2021-02-25 - Graph database ideas
Preface
Planning on moving a lot of the project's data into graph database/s, primary reasons:
- allows for effective querying according to relationships, and their will be a lot of this, eg: location tree, catalogs, production attributes, following actions performed by a user, buyer habits etc
- elegant methods of versioning data, almost all data in the project will be immutable, so need to version all data that can be changed
One huge graph, per service, or per stack?
One Project Graph containing all graphed data
Pros:
- ready for any type of relationship query we want to perform
Cons:
- all graph databases seem to have some form of scaling issue, eg neo4j does not index edges so if one vertex has many the modelling starts to fall apart
- Neptune specifically seems to struggle with large datasets, and has a limit on the number of edge labels/properties, difficult to apply this to an ever more complex huge project graph
- lack of compartmentalization, easier for mistakes to affect other areas of the project
- one point of failure
One graph database per service
Pros:
- compartmentalized
- smaller data sets
Cons:
- Lose a lot of the gains of a graph database, eg rich relationship network
Multi service graphs
Planning on using this middle ground, where a logical set of data will be stored in one graph that is shared by multiple services, often per stack.
Plan on keeping the structures standard (eg user object is the same type of object across all graphs, and common edge labels kept the same) so can easily patch these graphs together into a larger graph if needed for relationship querying.
Service that manages a graph
When a graph is shared across multiple services (and most will likely have that possibility, so probably all graphs), deploy as a separate service. Consider making this service standard/generic, with config settings for each implementation.
This service could have settings such as which data elements can be edited and which cannot, with a Lambda that checks this, securing immutable data.
Could standardize tasks like updating versioned data of different types 2021-02-22 - Maintaining change history using graph database
Which graph database
I like the design of neo4j more than Neptune, it does not have the edge label limitations and stores relationships per vertex for effective traversal, also seem some reports of Neptune falling apart on traversals of multiple hops on large datasets, but for now use Neptune as easy (and cheap?) to setup and maintain, using the same query language as neo4j we can migrate later.
Property Graph
Comparison with RDF
Property Graph was chosen over RDF because connectivity with external services is not a priority, and can be extracted from a Property Graph. The expected design will have a lot of relationships (predicates) that themselves have properties. Also queries and visualizing the projects objects into a Property Graph seems simpler than using an RDF.
Decision to change from Neptune to neo4j
- Neptune's development appears lackluster
- Code samples and help with issues are limited, code samples are often out of date or have errors
- Many reports of slow queries or queries that do not run to completion
- Difficulty in connecting to Neptune, javascript gremlin library does not support sigv4 required by Neptune IAM, gave up trying to connect when IAM enabled (tried within VPC, via Node.js and Python, all route table/security group options could think of)
- Not confident about Neptune's indexing, like neo4j's node traversal model more
- No way to create local database for testing
- Chose neo4j because it seems the most active and developed
Optimizing data modelling for Neptune
- Neptune's indexing favors vertex ids, edge labels, and edge ids, so try to design that we can bound these in queries
- edge label seems more important than the edge id for optimization, but if can give both is best (now thinking edge id would be efficient too)
- Neptune does not like a lot of edge labels, wants at most 100's. Keep to standard label names to map relationship, although the more results that get returned per edge label will increase the post query filtering that needs to be done, so need to balance labels vs number of results per vertex.
- distinct edge labels also include properties of vertices and edges, see forum thread below
- because of Neptune's limits on edge labels + properties(?) deciding to not have one huge graph for entire project, instead break into smaller graphs according to logical groupings and expected relationship queries.
- if our queries always include the property name or edge label then the queries are still efficient, it is when the query has open ended reverse(?) traversals that this is an issue, see forum thread below
- for our use case where we will often be querying one vertex and finding relationships from there neo4j might be more effective as it stores relationships per vertex efficiently whereas Neptune stores indexes of vertex-edge-vertex types. Initially use Neptune for ease of management
References
https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html https://forums.aws.amazon.com/thread.jspa?threadID=336321