We would like users to be able to set schema on a Table as one of the table's properties. This is a read-time schema that will enable ad-hoc sql queries through Hive, as well as reading/writing to/from tables with higher level objects than byte arrays.
There is some overlap with the schema work being done for streams, as the concept is similar.
We can do this in phases.
Phase1 is entirely internal, where we add schema as a property of each table without letting users set it. The default schema for a table is a record of 2 fields: row as a byte array and columns as a Map<byte, byte>. This will require:
1. An internal Schema object to Hive schema converter
2. StorageHandler, SerDe, InputFormat and RecordReader for reading from tables using Hive, and OutputFormat for writing into tables using Hive.
3. Creating external hive tables for each core table dataset on creation of table dataset.
Phase2 is where we expose schema for ad-hoc queries and let users set schema on a Table through a RESTful endpoint. This will require:
1. Format for writing to and reading from tables. This determines how lists, maps, and records are stored in a table. For example, one way is to store a record is to serialize the entire object and stored it in a single column. Another way is to store each field of a record in a separate column.
2. Some sort of object (Record? StructuredRecord?) to represent a record when we perform deserialization, and a way to take a schema, table, and format and deserialize rows into records.
3. RESTful API for setting schema on a table
- what happens to running queries when schema is changed?
- do we allow setting multiple schemas on the same table to create different views on the table?
Phase3 is where we expose schema programmatically, allowing users to set a schema on a table in their code and interact with it through higher level abstractions. For example, instead of writing bytes, they can write an object. Instead of reading bytes, they can read an object. This will likely be part of a later release.