We have identified that schema is something we want to be able to define for a dataset, and maybe for streams as well. This task is to define what a schema is, and get some idea of how users will interact with schemas.
1. Users should be able to create a core dataset with a schema. Users should be able to create a core dataset with a schema through the RESTful APIs similar to how they create datasets today. For example:
They should also be able to do this through the Java APIs:
Schema is not enforced on writes by default, but can be set so that it is.
2. Users must be able to change schemas attached to data. For example, it should be possible to add a field to a schema. Changing schema will not change the underlying existing data.
Q: should the platform restrict what types of schema changes are allowed? For example, should it prevent a user from changing a float field into an integer field?
3. Schema is a property of the data, describing structure and content. As such, the platform should provide some additional functionality once schema has been defined for some data.
- It should be possible to run ad-hoc sql queries on the data
- It should be possible to add indexes on one or more fields of the data.
- From the Java code, it should be possible to work with objects instead of byte arrays.
Defining a Schema:
A schema assumes your data is broken up into a collection of records and defines what fields can be present in a record along with the type of the field. We must define what field types the platforms supports, as well as whether we want to support additional rules for fields, such as required fields or default values for fields. We must also define how a schema is represented. One proposal is to start off is to expose our internal Schema representation, which is the avro schema except with support for map keys that are not strings:
We can add support for other schema syntaxes, such as the hive schema syntax, but we will start with this avro superset. The usage of schema in ad-hoc queries assumes there is a way to decode underlying data representations into fields that can be queried with just the schema. This means that our core datasets will be in control of how data is written, in order to ensure that it can read those fields back out. Since our schema supports types that hive does not support, certain fields in a schema may not be exposed for ad-hoc queries.