Amazon Redshift Spectrum
Redshift is a data-warehousing tool that queries S3 directly. It has been usefully benchmarked against the 3x main contenders for pole position, in this cogent research (updated annually) by George at FiveTran (“a data pipeline that syncs data from apps, databases and file stores into customers’ data warehouses”); whilst they found no warehouse was dramatically differentiated from the others, there are elements of Redshift which make it stand out for particular applications & contexts:
- Useful blend of managed vs fully serverless (with BigQuery being the only truly serverless)
- Redshift Spectrum which
Over the last two years, the major cloud data warehouses have been in a near-tie for performance. Redshift and BigQuery have both evolved their user experience to be more similar to Snowflake. The market is converging around two key principles: separation of compute and storage, and flat-rate pricing that can “spike” to handle intermittent workloads.
Leter, specifically on Redshift and “RA3” (announced in this video):
Pre-RA3 Redshift is somewhat more fully managed, but still requires the user to configure individual compute clusters with a fixed amount of memory, compute and storage. Redshift RA3 brings Redshift closer to the user experience of Snowflake by separating compute from storage.
I can’t work out what the acronym RA3 stand for but the nodes are built on AWS Nitro System and feature high bandwidth networking & performance, “indistinguishable from bare metal” according to this source.
The “getting started” sequence for Redshift has a few notable errors:
- filename date2008_pipe.txt instead of the advised date_pipe.txt
- delimiter “\t” in sales_tab.txt instead of the advised “|”
- date format “D/M/YYYY hh:mm:ss” instead of the advised “YYYY-MM-DD hh:mm:ss”; I resolved this using Google Sheets as follows
=filter(
arrayformula(
if(
j:j=””,,text(datevalue(j:j),”YYYY-MM-DD “)&right(j:j,8)
)
)
,a:a<>””)
The “getting started” sequence for Redshift Spectrum has a few notable errors:
- row count 172456 instead of the advised 172462
But it works enough to be able to demonstrate the chief best practice: “keep your larger fact tables in Amazon S3 and your smaller dimension tables in Amazon Redshift”.