Logos of Terraform, AWS Athena and Apache Spark

Outline

Athena is a managed Prestodb offering from aws, it primarily serves the purpose of query execution/data exploration against Hive DDL predefined databases (namespaces). In order to quick-dive to Athena we’ll use terraform to provision the necessary resources needed for a basic functional setup, then we’ll process and convert a dimensional like data set from csv to parquet format, create & apply DDL statements on top of our data, run some analytical queries and and finally look at further recommendations and other common use cases.


Contents:

1. Prerequsites:

  • Terraform v0.12.20 or > with aws provider v2.49.0 or >
  • awscli, python 3.7 and apache spark 2.4.9
  • IAM role with access key pair for programmatic access, the role should be granted the AmazonS3FullAccess and AmazonAthenaFullAccess permission policies
  • You can clone the files directly from here
├── csv-parquet.py
├── exec-query.py
├── infra
│   ├── athena.tf
│   ├── bucket.tf
├── data
│   ├── ..
└── queries
    ├── ..

2. Infra

Here, we’ll simply declare a few resources; athena workgroup, S3 bucket for our data and athena database including the minimun of SSE.

Workgroups are aimed at separating users, teams, applications, or workloads; enforcing cost limits, and tracking query related metrics which can be published to CloudWatch and other handy stuff.

Other than that it’s to figure out the purpose of the named query feature, the one and only candidate seems to be the aws console UI to Athena, which probably needs to store users queries/opened tabs somewhere in-between sessions, but really a weird feature, one should ask the business people.

Adjust the shared_credentials_file file to your Athena IAM role credentials profile, and execute the commands.

cd infra/
terraform init
terraform apply -var 'tag=<your-tag>' -var ' aws_region=<your-region>'

athena.tf

provider "aws" {
  region                  = var.aws_region
  shared_credentials_file = "~/.aws/credentials"
  profile                 = "Athena"
}

resource "aws_athena_workgroup" "this" {
  name = "athena_test"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true

    result_configuration {
      output_location = "s3://${aws_s3_bucket.this.bucket}/outputs/"

      encryption_configuration {
        encryption_option = "SSE_S3"
      }
    }
  }
  tags = {
    Name = var.tag
  }
}

resource "aws_athena_database" "this" {
  name   = "test_db"
  bucket = aws_s3_bucket.this.bucket

  encryption_configuration {
    encryption_option = "SSE_S3"
  }
  depends_on     = [aws_s3_bucket.this]
  // !For easy terraform destroy only
  force_destroy  = true
  tags = {
    Name = var.tag
  }
}

bucket.tf

resource "aws_s3_bucket" "this" {
  bucket = "athena-showcase"
  region = var.aws_region
  // AES256 on Server Side
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
  // !For easy terraform destroy only
  force_destroy = true
  tags = {
    Name = var.tag
  }
}

vars.tf

variable "tag" {
  type    = string
  default = "athena-test"
}

variable "aws_region" {
  type    = string
  default = "eu-central-1"
}

3. Dataset processing

Athena is supposed designed to scan big data sets, so testing it with a handful of records wouldn’t really feel like running a query. The aws open data sets has some decent ones choose from. The chosen data set has a fact (cube) file with ~80 million rows, the compressed archive is ~1GB, once inflated ~6GB and converted to parquet with snappy compression ~1.6GB, with an average bandwidth of ~50 Mbps and 8gb of RAM the get-convert-put should take ~15-20 minutes. Since parquet is a column oriented storage format, you can take at how the resulting files are structured.

curl --request GET -L \
     --url 'http://data.gdeltproject.org/events/GDELT.MASTERREDUCEDV2.1979-2013.zip'\
     --output 'data.zip'

unzip data.zip -d data
rm data.zip
./csv-parquet.py
aws s3 cp data/gdelt.masterreducedv2 s3://athena-showcase/ --recursive --profile Athena

Considering the purpose again, Athena is really a query tool only, although it is capable of CTAS statements, INSERTs/UPDATEs, but these operations have limitations when it comes to partitioning. Whence the management of Transformations and the entire data catalog falls outside its competencies, which makes the deployment of it a one time static declaration.

The widely adopted approach for processing larger data-sets is to Spark Apps via the EMR cluster, and write the results to S3. For the demonstration purpose, twenty lines of python can do as much as:

  • Create a Spark App
  • Read & preview (head-tail) the CSV’s into dataframes
  • Write the Hive DDL based on the dataframes’s inferred types
  • Write the data out in parquet format and finally remove the original files
#!/usr/bin/env python3.7
from pyspark.sql import SQLContext, SparkSession
import os

SS = SparkSession.builder.appName('CSV to Parquet').getOrCreate()
data_dir = os.getcwd() + '/data/'
files = [data_dir + file for file in os.listdir(data_dir)]

for file in files:
    print(file)
    ctx = SQLContext(SS).setConf("spark.driver.memory", "8g")
    df = ctx.read.csv(file, inferSchema=True, header=True)
    df.show(n=10)
    df.printSchema()
    columns = ', \n'.join([f'{col[0].lower()}' + f' {col[1]}' for col in df.dtypes])
    table_name = os.path.basename(file).replace('8277.csv', '').lower()
    hive_ddl = f'CREATE EXTERNAL TABLE {table_name} (\n{columns})' \
               ' \nSTORED AS PARQUET ' \
               f'\nLOCATION \'s3://athena-showcase/{table_name}/\';'
    open(file=os.getcwd() + '/queries/' + table_name + '.ddl.sql', mode='w').write(hive_ddl)
    df.write.parquet(path=data_dir + table_name, compression='snappy')
    os.remove(file)

4. Hive DDL + DML

So once the files reside in the bucket, it is time to define the schema on top of them. I’ve tried to experiment with the AWS glue crawler for around 30 minutes, and had some mixed results from the DDL it produced, try it yourself, I don’t wish to express my biased opinion.

Another thing which comes as a bit of surprise is that the column & table names in Hive have to be lowercase and can contain white space (ideally just underscores), which is more of a end user/cosmetic issue for data modelling, but hey there you go one less convention to think of when designing your data catalog.

Also beware that you’re not going to receive as many useful sql error messages when interacting with Athena, the DDL may execute fine, then you’ll run your query and realize that the target table is emtpy, well there won’t be any checks on the underlying S3 data, if you’ve missed a correct S3 location path, too bad, you’re on your own to spot these details.

CREATE EXTERNAL TABLE IF NOT EXISTS test_db.gdelt_masterreducedv2 (
date int, 
source string,
...
actiongeolong double) 
STORED AS PARQUET 
LOCATION 's3://athena-showcase/gdelt.masterreducedv2';

You can execute

List the files aws s3 ls s3://athena-showcase/ --recursive --human-readable --profile Athena If you also find yourself having issues with the aws user interface, the good news are;] that you can print the contents of a file from S3 to your terminals stdout, just like -> aws s3 cp s3://athena-showcase/outputs/xyz.key - --profile Athena

5. Taking it further:

  • Since S3 is commonly used for log shipping of other various aws services (elb, clould front, rds, billing, etc.), it is a natural fit for Log inspection and analysis.

  • Should you decide to serve analytical data marts via Athena, the first most obvious challenge which comes to us is how do we accommodate a robust ETL/ingesting process and use the key partitioning, so that we don’t end up scanning a single indefinitely (unless the size is tiny).

  • If you wish to store & query star schema like analytical models, be sure to research your options for storage format well as well as the nature of your dimensions, consider scenario where your fact table has to combine events which happen on different timeline, shipment and order for instance. Updates on files are not an option here, one has to ideally think of append only designs.

  • Beware of empty and quoted values/(columns) for delimited (CSV) files, it’s smoother to start with Avro, Parquet and Json formats.

  • One of the ideas behind this technology is to give our users (analysts) a fast access to a vast amount of data, or in other words make data sets easily discoverable, with power comes responsibility, so should you embrace on building a robust Hive catalog, consider having an data access strategy from an early point (beware that row/attribute access policies cannot be utilised).

  • The AWS user interface offered with Athena is (no offence) one I’d rather not look at too much, JDBC or ODBC client drivers are available, and the instructions on how to setup the driver and connect are concise. For a correct build of the JDBC connection URL, take a look at page 14 of the guide. For our case here, the following one did do: jdbc:awsathena://athena.<region>.com:443;User=<aws_access_key_id>;Password=<aws_secret_access_key>;S3OutputLocation=s3://<bucket>/<result>;S3OutputEncOption=SSE_S3;

  • From a prod grade setup point of view, our source buckets should of course be private, the Athena db should be declared in aws VPC, and a have private endpoint configured, unless our users are connecting to our organisation’s VPN network.

Opinion

Athena is a vital part in the aws analytics stack, however when considering it as a candidate for let’s say data discovery purposes then your audience should be SQL savvy, and be prepared to work with the PrestoDB functions reference, as well as facing more ambiguities not present in traditional database systems. Since this is the paradigm of separation of storage and compute resources, our data is as good&useful as we’ve preprocessed it.