Status: In Progress
Affects Version/s: BIGDATA_RELEASE_1_5_0
Fix Version/s: None
Component/s: Bigdata Federation
- Break apart the large configuration file (this goes hand in hand with dropping the SMS).
- Simplify the configuration of each service.
- Provide a runstate control program for each service (basic start/stop/status)
- Document failover practices (this involves having a leader election with services contending for failover).
- Provide a deployer of the services onto a cluster.
There are several services:
- ServiceManagerServer (SMS). This attempts to automate the service starts on the different nodes and provide some ?flex? configuration capabilities. A lot of people have asked for a simpler mechanisms based on configuring an image for a given service type and then using whatever mechanism (puppet, etc.) to deploy those services.
- TransactionServer (TXS). This service maintains alternating snapshots of the global commit time index. This index provides an ordered set of the retained commit times. This information could also be recovered from the DS nodes. I believe that the main reason why it is important is the read retention policy. Open read-only transactions are tracked by the TXS and the TXS will not permit DS nodes to release commit state on which those transactions are reading.
- DataServer (DS). Scale-out has a lot of DS tweaks. Most of these are around bulk load performance through the streaming index writer.
- MetadataServer (MDS). Not that much tweaking is required here. Failover is identical to the DS handling.
- LoadBalancerServer (LBS). This is used primarily for load-balancing shards based on write hot spots and also aggregates performance counters from the cluster. The performance counter aggregation support should be removed since we can use things like ganglia for that and actually tracking a lot of performance counters turns into an unscalable memory burden on the LBS. The load balancing hot-spot decision making should be eventually modified to use an external performance counter tracking platform, but we can keep the relatively low overhead performance counter collection and reporting mechanisms and the LBS can continue to use that in the short term. The LBS can write an interesting event log, but does not have any durable data that needs to be restart safe.