AWS Neptune walkthrough
Hands-on labs/tutorials for Neptune are fewer & further between, than those for other services. Thought I’d have a go at writing one.
Useful resources:
What’s a cluster?
Neptune graph database is built to handle billions of relationships with query access at millisecond latency, on a cluster model whereby you have 1 primary master instance, and up to 15 read replicas* governed in a configured “pecking order”. These can be accessed via the usual tools:
- AWS Console in browser (steps listed below)
- scripts
- CLI
Console steps
- Login at aws.amazon.com and select Region.
- Pick Neptune from services dropdown, and Databases from left sidebar/drawer.
- A database is a cluster of writers & readers; these are shown in hierarchical nested view showing useful metrics incl status & CPU.
- Using orange button for Create Database, explore the basic config including Engine Version (Engine Releases documented here), DB Cluster Identifier (globally unique), Templates , DB Instance Size (resembling the classic EC2 instance sizes such as db.t3.medium at c10ph), Availability & Durability (incl AZ preference), and Connectivity (I’ve yet to need to use an option other than “Default VPC” but your security boundary requirements may be more advanced).
- Additional Configuration is available incl DB Instance Identifier, DB Cluster Parameter Group, DB Parameter Group, IAM DB Authentication, Rollover Priority, backup & encryption options (retention periods 1–35 days, and key management) and audit logging (e.g. to CloudWatch).
- Maintenance, maintenance window, and deletion protection, are features towards end of Additional Configuration which you may find particularly useful but require a detailed understanding in order to know how to configure safely & cost-effectively. Please refer to docs before using these.
- Whilst waiting for creation to complete (Status will change from Creating to Available), create a notebook instance for analysis of the cluster: select Notebooks from the left sidebar/drawer; hit Create Notebook; select instance type (as above the default — mt.t3.medium — has served me well). These are of course SageMaker notebooks behind the scenes (so you access/open them via the SageMaker service selected from Services dropdown); they are powered by Jupyter and are sometimes referred to as “workbenches”.
Every cluster has a default parameter group. One of these is neptune_lab_mode used for enabling experimental features which are recommended not to be used in Production; these have included e.g. Neptune Streams… for the full list/pipeline please visit link at top of this article. Another is neptune_query_timeout which I have found useful in projects to date.
*the 15 is a limit found consistently across AWS with Aurora (another databasing service) also maxing out at 15 “Aurora replicas”; I expect this to rise in 2021 and will link to product announcements as they are published.