Is BrighstartDB single Server System?

Mar 2, 2014 at 7:25 AM
I was trying to search documentation about replication, but seems there is no information anywhere. I wanted to know is this strictly an embedded or single system service? Do you plan to add replication or multi data storage system to make it truly scalable without worrying about re-designing apps?
Coordinator
Mar 2, 2014 at 8:30 AM
Currently B* is a single-server / embedded system. We have plans and some initial code for sharding and replication, but it is not ready for release just yet... which is why there is no documentation about that :-)
Mar 5, 2014 at 11:32 AM
Do you have any public documentation about the planned semantics of replication/sharding in future B* releases, where developers might provide feedback to help define this important new functionality?
Coordinator
Mar 7, 2014 at 1:55 PM
It definitely makes sense to share our thinking. There is some code from a first attempt, but I have been looking around at other FOSS projects that manage replication and sharding and that has started to influence my thinking about how B* should handle it. I'll try and write something up - maybe as a document in the github repository. I want to focus on getting the 1.6 release out first though. I'll post here when the doc is written...
Coordinator
Mar 13, 2014 at 3:29 PM
I've created an issue for this on GitHub - https://github.com/BrightstarDB/BrightstarDB/issues/86
Mar 14, 2014 at 8:36 AM
A interesting view about scalability issues of triple stores:
here

I've added it here since it's kind of related to the topic.
Coordinator
Mar 18, 2014 at 11:32 AM
feugen24 wrote:
A interesting view about scalability issues of triple stores:
here

I've added it here since it's kind of related to the topic.
That is interesting. It adds one additional thing to thoughts about scaling - in addition to thinking about replication and sharding as approaches for horizontal scaling we could / should also consider federated search as that would allow you to create partitions as separate stores but still query across some/all of them when necessary.
Coordinator
Mar 18, 2014 at 11:34 AM
OK, I have put together thoughts on replication. I have some more scribbles about sharding but I think replication is the first step to tackle and I want to keep discussion focussed on what is practical to get done in the medium term. It is all here: https://github.com/BrightstarDB/BrightstarDB/wiki/BrightstarDB-Replication. Please use this thread for discussion / comments.

Thanks!
Mar 20, 2014 at 9:07 PM
Edited Mar 20, 2014 at 9:55 PM
Another video worth watching...about virtuoso7, it speaks in general terms but it worth seeing what are their thoughts on scaling.
here

Related to that document:
  • i would pick "Replicate transactions", seems better choice by far
The primary responds with the transaction data for the next transaction in the sequence (or should we allow the primary to send multiple transactions if the secondary is falling behind?)
  • if it sends multiple transactions then it means it could be extended in the future to be configurable (1 or more transaction per pull) or dynamic depending on some factors (network speed, etc), not sure of how much effort required instead of using 1 transaction data
It should also be possible to request that a write be pushed to the replicas and optionally confirmed (question: should this be on an operation-by-operation basis or should it be part of the replica set configuration?). In this scenario when the primary has committed its operation it will notify the replicas of the operation and optionally wait for some number of replicas to confirm that they have applied the transaction before releasing the write-lock on its own local store.
  • if I have this mode I can't do sync writes from UI calls since the final user would have to wait until slaves confirm, but this would work in scenarios where from UI there are async calls and UI/caller doesn't care when they are finished. The problem here is that if I set this for whole cluster then I can only do async calls from UI...so to set this by operation seems more appropriate.
Question: Is replication without failover of any practical use - i.e. is it something worth releasing?
  • this depends on how coupled with B* is this part (see below)
Some thoughts not particular to B* scaling discussion:
I'm not too familiar with replication/clusters and implementation details but what comes into my mind is that this is a common scenario that each db tries to solve in a way that is coupled with the database (probably because of performance, vendor lock-in, etc).
In essence this scenario is a communication protocol between nodes in a graph so that the data is distributed, but the problem is kind of independent of the actual data/database.
For example if the API allows me to create a node that is not aware of a db, and through an source adapter links to db and gets transactions (changesets) as messages, then I could write a destination adapter(to a destination node) and replicate that data into any db: RDMS, Document, Graph etc, and then query in different data format (similar to event sourcing/streaming idea)
Of course this kind of destination node could not replace the master.
This is why "Replicate transactions" seems better.
Seems there is an implementation of this idea for relational stores http://www.symmetricds.org/
Coordinator
Mar 25, 2014 at 11:15 AM
Sorry for the delay in replying !

feugen24 wrote:
Another video worth watching...about virtuoso7, it speaks in general terms but it worth seeing what are their thoughts on scaling.
here

Related to that document:
  • i would pick "Replicate transactions", seems better choice by far
Yep I think that overall replicating transactions is the way to go for lots of reasons
The primary responds with the transaction data for the next transaction in the sequence (or should we allow the primary to send multiple transactions if the secondary is falling behind?)
  • if it sends multiple transactions then it means it could be extended in the future to be configurable (1 or more transaction per pull) or dynamic depending on some factors (network speed, etc), not sure of how much effort required instead of using 1 transaction data
It will add some complexity but it could be really useful, especially if the workload is lots of small transactions. I think being dynamic rather than configurable would be best, though that adds more complexity again. As you say, maybe start with a protocol design that allows the server to provide multiple transactions as a payload, and start off with some really simple approach (such as only sending 1 transaction at a time) and leave it open to be extended with configurable/dynamic options in the future.
It should also be possible to request that a write be pushed to the replicas and optionally confirmed (question: should this be on an operation-by-operation basis or should it be part of the replica set configuration?). In this scenario when the primary has committed its operation it will notify the replicas of the operation and optionally wait for some number of replicas to confirm that they have applied the transaction before releasing the write-lock on its own local store.
  • if I have this mode I can't do sync writes from UI calls since the final user would have to wait until slaves confirm, but this would work in scenarios where from UI there are async calls and UI/caller doesn't care when they are finished. The problem here is that if I set this for whole cluster then I can only do async calls from UI...so to set this by operation seems more appropriate.
Yes, the only downside to that is that it introduces replication concerns into the client-server protocol. But as you say there are scenarios in which you might want to mix the way writes are handled on a client-by-client basis.
Question: Is replication without failover of any practical use - i.e. is it something worth releasing?
  • this depends on how coupled with B* is this part (see below)
Some thoughts not particular to B* scaling discussion:
I'm not too familiar with replication/clusters and implementation details but what comes into my mind is that this is a common scenario that each db tries to solve in a way that is coupled with the database (probably because of performance, vendor lock-in, etc).
In essence this scenario is a communication protocol between nodes in a graph so that the data is distributed, but the problem is kind of independent of the actual data/database.
For example if the API allows me to create a node that is not aware of a db, and through an source adapter links to db and gets transactions (changesets) as messages, then I could write a destination adapter(to a destination node) and replicate that data into any db: RDMS, Document, Graph etc, and then query in different data format (similar to event sourcing/streaming idea)
Of course this kind of destination node could not replace the master.
This is why "Replicate transactions" seems better.
Seems there is an implementation of this idea for relational stores http://www.symmetricds.org/
I hadn't thought of this, but it does make sense.
Jun 12, 2014 at 7:19 PM
Question, would the idea be to include native support for easily connecting the a mobile client to a server for an online/offline type access pattern?
Coordinator
Jun 17, 2014 at 8:48 AM
It would be a nice idea to have that support. I guess that to be practical though it would require some support for data synchronization - with our transaction logs we mostly have the ability to list what triples where added/deleted on a store from a given point in time (I say mostly because currently that is not true when you import a file, we just store the file path). I guess that there would also need to be some sort of additional metadata (maybe just another operation in the transaction log) that tells you when you last synchronized with a particular store.

If we were to add data synchronization to our roadmap, would people rate that as more important or less important than replication/clustering ?
Jun 17, 2014 at 10:20 AM
Good question. I think I would go with data synchronization being of higher value for a few reasons. The main one being that with data sync comes the ability to work with online/offline scenarios that are common in mobile application development. That said, I would think that data sync support would form a good foundation for replication/clustering where only changed data was replicated?
Jul 15, 2014 at 8:17 PM
Question -- any thought on how this fits into the roadmap and a possible version release target? I am curious because I am in the process of starting a new project that has a hard requirement for transparent online/offline with sync support (mobile field application in unpopulated areas with low cell connectivity). If there are plans to include some sort of replication in the next 3 to 6 months,m the BrightStar is by far my number one choice.
Coordinator
Jul 20, 2014 at 3:50 PM
Its not really on any roadmap with dates attached at the moment. I would hope to get around to being able to do something about this during the next 6 months, but it really depends on other time pressures (both BrightstarDB related and otherwise).