When consuming partitions, we do a scan on an indexed table. This scan
- hardcodes the HBase scanner client cache to 1000
- scans all 1000 items and performs a Get for each of them - until the working set is full.
1000 Gets normally only take about 10 seconds, which is way below the typical timeout values. However, when HBase is under heavy load or responds slowly for other reasons (for example, slow HDFS, slow network, majpr compaction, HDFS rebalance), this can easily take minutes and exceed both the HBase RPC and Scanner timeout, as well as the transaction timeout.
Reducing the working set can mitigate this (it limit the number of Gets), but that is not acceptable in all scenarios.
PartitionedFileSet (or the partition consumer) should have a way to renew the scanner lease, or two close the scanner before it times out, to be resilient in such situations.
Possibly this should be an option for indexed table, or even for scans on any Table.