Preface

Planning on moving a lot of the project's data into graph database/s, primary reasons:

allows for effective querying according to relationships, and their will be a lot of this, eg: location tree, catalogs, production attributes, following actions performed by a user, buyer habits etc
elegant methods of versioning data, almost all data in the project will be immutable, so need to version all data that can be changed

One huge graph, per service, or per stack?

One Project Graph containing all graphed data

Pros:

ready for any type of relationship query we want to perform

Cons:

all graph databases seem to have some form of scaling issue, eg neo4j does not index edges so if one vertex has many the modelling starts to fall apart
Neptune specifically seems to struggle with large datasets, and has a limit on the number of edge labels/properties, difficult to apply this to an ever more complex huge project graph
lack of compartmentalization, easier for mistakes to affect other areas of the project
one point of failure

One graph database per service

Pros:

compartmentalized
smaller data sets

Cons:

Lose a lot of the gains of a graph database, eg rich relationship network

Multi service graphs

Planning on using this middle ground, where a logical set of data will be stored in one graph that is shared by multiple services, often per stack.

Plan on keeping the structures standard (eg user object is the same type of object across all graphs, and common edge labels kept the same) so can easily patch these graphs together into a larger graph if needed for relationship querying.

Service that manages a graph

When a graph is shared across multiple services (and most will likely have that possibility, so probably all graphs), deploy as a separate service. Consider making this service standard/generic, with config settings for each implementation.

This service could have settings such as which data elements can be edited and which cannot, with a Lambda that checks this, securing immutable data.

Could standardize tasks like updating versioned data of different types 2021-02-22 - Maintaining change history using graph database

Which graph database

I like the design of neo4j more than Neptune, it does not have the edge label limitations and stores relationships per vertex for effective traversal, also seem some reports of Neptune falling apart on traversals of multiple hops on large datasets, but for now use Neptune as easy (and cheap?) to setup and maintain, using the same query language as neo4j we can migrate later.

Property Graph

Comparison with RDF

Property Graph was chosen over RDF because connectivity with external services is not a priority, and can be extracted from a Property Graph. The expected design will have a lot of relationships (predicates) that themselves have properties. Also queries and visualizing the projects objects into a Property Graph seems simpler than using an RDF.

Decision to change from Neptune to neo4j

Neptune's development appears lackluster
Code samples and help with issues are limited, code samples are often out of date or have errors
Many reports of slow queries or queries that do not run to completion
Difficulty in connecting to Neptune, javascript gremlin library does not support sigv4 required by Neptune IAM, gave up trying to connect when IAM enabled (tried within VPC, via Node.js and Python, all route table/security group options could think of)
Not confident about Neptune's indexing, like neo4j's node traversal model more
No way to create local database for testing
Chose neo4j because it seems the most active and developed

Optimizing data modelling for Neptune

Neptune's indexing favors vertex ids, edge labels, and edge ids, so try to design that we can bound these in queries
edge label seems more important than the edge id for optimization, but if can give both is best (now thinking edge id would be efficient too)
Neptune does not like a lot of edge labels, wants at most 100's. Keep to standard label names to map relationship, although the more results that get returned per edge label will increase the post query filtering that needs to be done, so need to balance labels vs number of results per vertex.
distinct edge labels also include properties of vertices and edges, see forum thread below
because of Neptune's limits on edge labels + properties(?) deciding to not have one huge graph for entire project, instead break into smaller graphs according to logical groupings and expected relationship queries.
if our queries always include the property name or edge label then the queries are still efficient, it is when the query has open ended reverse(?) traversals that this is an issue, see forum thread below
for our use case where we will often be querying one vertex and finding relationships from there neo4j might be more effective as it stores relationships per vertex efficiently whereas Neptune stores indexes of vertex-edge-vertex types. Initially use Neptune for ease of management

References

https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html https://forums.aws.amazon.com/thread.jspa?threadID=336321

2021-02-25 - Graph database ideas

Contents

Preface

One huge graph, per service, or per stack?

One Project Graph containing all graphed data

One graph database per service

Multi service graphs

Service that manages a graph

Which graph database

Property Graph

Comparison with RDF

Decision to change from Neptune to neo4j

Optimizing data modelling for Neptune

References

Navigation menu

2021-02-25 - Graph database ideas

Preface

One huge graph, per service, or per stack?

One Project Graph containing all graphed data

One graph database per service

Multi service graphs

Service that manages a graph

Which graph database

Property Graph

Comparison with RDF

Decision to change from Neptune to neo4j

Optimizing data modelling for Neptune

References

Navigation menu

Search