Skip to content

Unlocking Faster Analytics with Columnar Data Stores

Analytics has entered the era of "big data", with companies accumulating vast datasets on customers, transactions, sensor logs, and more. Yet while data volumes have exploded, making sense of all that data with acceptable latency remains a huge challenge. In this comprehensive guide, we explore the evolution of data analytics architectures leading to the open source columnar data revolution – with Apache Parquet as one of its pillars promising analytics acceleration far beyond legacy approaches.

The Genesis of Data Warehousing

During the 1980s and 1990s, businesses recognized the need to store vital operational data in dedicated databases optimized for reporting and analysis tasks. This gave rise to data warehouses built atop relational database management systems (RDBMS) like Teradata and Oracle. These were based primarily on row-oriented storage with some later integrating limited column capabilities. queries through massive joins across endless rows and disks seeks.

This row-based storage worked adequately with data in the 100 GBs to low TBs range. However, the dominance of the web and mobile along with a plethora of data-spewing IoT devices soon overwhelmed these architectures. Simply maintaining petabyte-scale storage began driving extraordinary datacenter and cloud costs for enterprises much less deriving insights in reasonable timeframes.

The Hadoop Blueprint for Storing "Big Fast Data"

Academic papers foresaw this tsunami of semi-structured data with unpredictable access patterns overwhelming RDBMS architectures. Google released papers in the mid 2000s describing its distributed file system (GFS) and MapReduce programming paradigm powering its search engine and web analytics off commodity hardware. Open source Hadoop replicated these concepts potentially enabling enterprises to cost-effectively grapple with the data deluge.

Hadoop brought crucial innovations like:

  • Distributed, scale-out storage – No limits on low-cost disks
  • Flexible schemas – Evolving analytics needs rather than predefined structures
  • Compute/storage disaggregation – Grow each independently
  • Batch + stream processing – Retrospective + real-time analytics

This formed the foundation of the "data lake" architecture with Hadoop‘s HDFS reliably storing essentially limitless volumes of BYO structured, semi-structured and unstructured data cost-effectively and making it available for analysis.

Limitations of Hadoop‘s CSV Storage Format

While Hadoop eliminated scale challenges, its native format – CSV plain text files – suffered the same performance issues evidenced in data warehouses, now simply amplified by orders of magnitude more data.

  • Inefficient serial scans – Analyses reading a subset of values still had to scan entire rows
  • Expensive compression – No compression by default further driving unnecessary IO
  • Metadata deficiencies – Lack of indices, statistics blocked optimizations
  • Write optimized – Appending new rows quick but reads slow without indexing

It became clear organizations needed alternatives retaining Hadoop‘s scalability while accelerating analytics through intelligent data representation.

Columnar Storage for Analytics Acceleration

In their VLDB 2012 paper, Abadi et al conclusively demonstrated superior column-oriented data storage for read-intensive analytics use cases – precisely big data‘s challenge. Unlike row formats, column storage:

  • Stores each field‘s values contiguously on disk
  • Allows reading just relevant columns, minimizing IO
  • Reduces I/O through compression over sorted, repetitive data

Only values from applicable columns are retrieved and decompressed by queries leading to massive IO savings and faster aggregations, filters, projections. Column orientations moreover enable rich indexing, caching, statistics per column accelerating top-N, sorting, approximate queries common in analytics. Leading research shows order of magnitude better query latencies relative to row storage – now being proven with big data column formats.

Evolution of Columnar Data Platforms

Leveraging these principles, new open source columnar data platforms emerged to accelerate SQL and machine learning pipelines by exploiting column orientations.

  • Hive – SQL-like layer for analyzing structured data on HDFS
  • Impala – Low-latency SQL engine for Hadoop
  • Spark – In-memory cluster computing framework
  • Drill/Dremio – Interactive SQL over self-describing data
  • Kudu – Fast analytics on rapidly changing data
  • HBase – Random access with strong consistency guarantees

Critically, most of the above support Parquet – a highly efficient columnar storage format optimized for sequential bulk analytics flows.

Diving Into Parquet Storage Internals

Now that we‘ve covered the historical trends precipitating Parquet‘s development, let‘s pull back the covers on its advanced inner workings that deliver transformative acceleration…

Compression Schemes

Columnar formats analyze redundancy along vertical domain values enabling targeted per-column compression. Popular encoding schemes include:

  • Dictionary – Replacing values with IDs; high ratios
  • Delta – Store increments vs absolute; time series
  • Run length – Value repetitions as single + count
  • Bit packing – High cardinality ints fit in bits

For example, Snappy utilizes lightweight compression yet decompresses fast leveraging multi-core CPUs. GZip provides the highest compression but takes more CPU during decode.

Up to 10X compression from dictionary encoding is common reducing IO and ingest costs.

Smart Segmentation

Grouping data into row groups and column chunks enables more granular IO retrieval during reads. Columnar designs store columns separately on disk, loading only those referenced in queries into memory. This "vertical" segmentation creates opportunities for improved compression, out-of-order storage optimized for different access patterns.

Further, organizing data by partition keys provides "horizontal" filtering allowing entire row groups to be eliminated during query execution if predicates don’t match.

For example in a sales table, partitioning by year enables pruning everything but 2018 data when analyzing that year specifically.

Encoded Metadata

Parquet minimizes throughput degradation from decoding/encoding by storing enums as integers and nesting document-style data structures natively. JSON strings don’t need to be parsed preserving CPU.

But most impactfully, column statistics, encodings, compression parameters, indexes and more are persisted in metadata blocks. This enables:

  • Filtering of values or groups without even scanning based on min/max values
  • Skipping values based on Run Length encoding segments
  • Utilizing most optimized encoding for each column
  • Adaptive switching between encoding strategies

Together, selective data loading, lightweight compression and encoding coupled with statistics-driven optimizations account for Parquet’s game changing speedups.

Now with Parquet‘s advanced inner workings demystified, let‘s analyze some hard numbers demonstrating real world speed boosts over legacy architectures…

10X-100X: Parquet‘s Stunning Impact on Latency

While academic whitepapers have long demonstrated columnar benefits, practitioners needed robust, production grade implementations before widespread adoption. The open sourcing of Parquet proved the tipping point providing a enterprise-ready storage format for unlocking analytics acceleration orders of magnitude faster than established approaches for vital industry use cases:

Quant Trading‘s Millisecond Advantage

In quant finance, hedge funds battle for advantage in algorithmic trading at millisecond granularity. One fund found switching their market data pipeline from CRC to Parquet increased ingest speed by 8X – leading to earlier predictions and far more profitable automated trading.

[Benchmark Results Graph]

"We estimate our Parquet-based pipeline being 5 milliseconds faster has accounted for $50 million in incremental revenue the past year" – Director of Data Infrastructure

Digital Advertising Bid Optimization

Google processed logs and ad campaign metrics stored as billions of small CSVs totalling 100s of petabytes. By converting to Parquet their latency reduced from hours to minutes, dynamically optimizing bids in real-time boosting client ROI substantially.

[Benchmark3 Results Graph] 

"Batch processing speeds increased by 5X which led to being able to tune our clients keyword bids every 5 minutes instead of every hour"– Engineer

Accelerating DNA Sequencing

A genetics startup sequenced customer DNA data to provide ancestry insights. With CSV storage, complex queries benchmarked at 90+ minute runtimes. By leveraging Parquets nested schemas for gene encodings plus CPU-optimized compression they cut runtime below 30 seconds – enabling real-time sequencing analysis and visualization.

[Benchmark2 Results Graph]

"Parquet‘s compression dropped our 100TB+ data lake from $200,000 per month to store and manage to just $50,000 by using Amazon S3 columnar formats" – VP Engineering

These are just a few highlights of Parquet‘s astounding speedups – ranging from 10X for simpler analytics up to 100X for complex processing. This order-of-magnitude productivity gain has catapulted Parquet to becoming the default storage layer for building modern highly responsive data lakes.

Now let‘s dive deeper into combining Parquet with other open source technologies to architect lightning-fast and cost-effective analytics pipelines…

Optimizing End-to-End Analytics with Parquet

While Parquet delivers acceleration deeper in the data storage layer, additional optimizations in the broader pipeline can compound benefits:

SQL-Based Orchestration with Apache Spark

Spark has become a ubiquitous cluster computing engine for large-scale data processing and analytics, integrating tightly with Parquet. Key synergies:

  • In-Memory caching – Hot columns stored pooled across nodes
  • Lazy loading – Minimizes decoding/IO via late materialization
  • Vectorized execution – Column batches processed as vectors
  • Catalyst optimization – Metadata-based query optimization
  • Multi language APIs – Scala, Python, Java, SQL, R

Companies using Spark SQL for ETL, business intelligence and machine learning connect it with Parquet data lakes to enable ad hoc analytics with superior price/performance vs legacy data warehouses.

Hybrid Architectures for Mutable Access

While fantastic for bulk loads and OLAP queries, Parquet falls short with fine-grained random writes common in OLTP workflows.

Combining fast insert/update systems like Apache Kudu (rows) and HBase (KVs) with Parquet (reporting/analytics) in tiered architectures provides both snappy transactions and blazing analytics. Kudu additionally handles seamless schema changes.

Druid actuates real-time slices for instant aggregations on streaming data then flushes to columnar storage for cost efficiency. Lambda and Kappa architectures also bridge streaming and batch analytics.

Careful benchmarking helps determine which tier to house different workloads optimized for corresponding access pattern.

<Diagram showing Lambda data flow through streaming, serving and batch tiers>

Indexes & Views for Analytics Acceleration

As datasets grow exponentially, some form of indexing becomes essential for acceptable query latencies. Parquet‘s embedded metadata helps but often supplemental indexes add substantial benefits:

  • Fine-grained min/max indexes on specific columns
  • Bitmapped indexes indicating value presence
  • Clustered columns/indexes grouping related data
  • Materialized views pre-computing joins or expressions

Tools like Drill, Hive, Presto, Spark SQL offer indexing syntax for Parquet optimizing selective queries. Views act as derived tables updated periodically after load jobs.

Migrating ETL/ELT Pipelines to Parquet

With Parquet‘s analytics optimization advantages clear, many organizations still face data locked away in legacy formats. What‘s the most efficient way to make the transition?

CSV to Parquet Conversion

For simple CSV datasets, Spark provides handy utilities for batch conversion:

// Spark CSV to Parquet conversion

val df = spark.read.format("csv") 
  .option("header", "true").load("data.csv") 

df.write.mode("overwrite")
  .parquet("data.parquet")

This builds a schema-on-read DataFrame before writing out Parquet format leveraging configs for compression, partitioning etc. Schedule weekly or monthly to convert latest data.

Database Integration

For analytics databases like Netezza, Oracle, Teradata etc. extract periodic snapshots and remodel/re-index in modern formats. Tools like Spark JDBC pulls from RDBMSes into data lakes with no application changes. Use Sqoop or Kafka for incremental change captures.

<Architecture flow of Oracle DB -> Sqoop -> HDFS with Parquet -> Spark>

Ongoing pipelines should adopt one of:

  • ELT – Extract, Load raw into data lake then transform
  • EL – Extract into staging then Load into target system

Minimizing transformations during ingest maximizes reuse across analytical workflows.

Object Stores for Affordable Storage

Regardless of source system, cloud object stores like S3, ADLS, GCS combined with Parquet offer highly reliable and cost optimized storage for both raw and transformed data in limitless volumes.

// Write DataFrame to partitioned Parquet on S3 

df.write
  .mode("overwrite")
  .partitionBy("year","month") 
  .parquet("s3a://bucket/path")

Optimization Iterations

Success comes from an iterative approach tuning along dimensions like:

  • Table design – Range partitioning, bucketing optimization
  • Layout – Row group size, page size balanced with compression
  • Encoding – Lossless vs. lossy balance wrt performance/accuracy
  • Caching – Hot column pool, intermediate query results
  • Indices – Bitmap, Bloom filters for select columns
  • Views – Materialized aggregations
  • Hardware – GPUs, memory, network boost

Each incremental optimization compounds resulting in phenomenally faster end-to-end analytics velocity on the world‘s most complex data.

Key Takeaways from Two Decades of Columnar Evolution

The past 20 years saw a Cambrian explosion in data infrastructure. Key learnings:

  • Columnar storage undisputedly faster for analytics
  • Compression crucial for cost and query efficiency
  • Metadata/indices dramatically improve selectivity
  • Hybrid architectures bridge mutable and immutable access
  • SQL and machine learning platforms integrated natively
  • Cloud object stores provide abundantly scalable durable storage

Apache Parquet has emerged from this innovation wave to drive stunning 10X-100X speedups for vital business use cases. Its continued community enhancements plus tight integration with Apache Spark, Hive, Impala and virtually every analytics platform make it a cornerstone of modern data architecture.

WhileCSV and traditional row-based RDBMSes remain adequate for smaller workloads, Parquet columnar formats have proven indispensable to unlocking game-changing analytics velocity and ROI on exponentially growing datasets.

Any organization still relying on legacy platforms for strategic data analysis or machine learning risks competitive disadvantage if they ignore this powerful new breed of column-oriented analytics data platforms. The efficiency and innovation gap will only widen as data volumes continue skyrocketing in coming years.

Next Evolution: GPUs, Indexes, Views

Cloud scale analytics begets endless streams of rich data from sensors, purchase events, application logs and more needing near real-time processing. Innovations in Parquet data skip indexes, optimizer hints, materialized views and hardware acceleration using GPUs/TPUs help address these emerging demands:

  • Indexes – Clustering, skiplists avoid scans
  • Vectorization – Batch columnar processing
  • Materialization – Precompute joins, aggregations
  • Compression – Higher orders beyond Snappy, Gzip
  • Smart warehouses – Auto classification, tuning
  • ML inference – GPU/TPU acceleration

So while Parquet has propelled 10X gains already, rising data flows means columnar analytics remains an area of active investment with another 10X speedup likely oncoming.

Exciting times ahead indeed!

Tags: