Parquet Parser
Parquet is currently experimental and may change in future versions.
The Protocol Buffers parser is used for parsing incoming Protocol Buffers data. To use it set
decoder to parquet. Fields can be mapped to a different name by using the parquet field names in
your schema’s source.
Types
You will need to specify the corresponding type and format. See Parsing Data for more
information about formats.
Primitive Types
| Parquet Type | Schema Type | Schema Format |
|---|---|---|
| boolean | boolean | |
| int32 | integer | int32 |
| int64 | integer | int64 |
| int96 | n/a | |
| float | number | float32 |
| double | number | float64 |
| binary | string | bytes |
Logical Types
Strings
| ConvertedType | LogicalType | Schema Type | Schema Format |
|---|---|---|---|
| UTF8 | STRING | string | |
| ENUM | string | ||
| UUID | string | uuid |
Numeric
| ConvertedType | LogicalType | Schema Type | Schema Format |
|---|---|---|---|
| INT_8 | INT(8,true) | integer | int8 |
| INT_16 | INT(16,true) | integer | int16 |
| INT_32 | INT(32,true) | integer | int32 |
| INT_64 | INT(64,true) | integer | int64 |
| UINT_8 | INT(8,false) | integer | uint8 |
| UINT_16 | INT(16,false) | integer | uint16 |
| UINT_32 | INT(32,false) | integer | uint32 |
| UINT_64 | INT(64,false) | integer | uint64 |
Temporal
Convert to date/time
| ConvertedType | LogicalType | Schema Type | Schema Format |
|---|---|---|---|
| TIME_MILLIS | TIMESTAMP(isAdjustedToUTC=true,unit=MILLIS) | string | date-time |
| TIME_MICROS | TIMESTAMP(isAdjustedToUTC=true,unit=MICROS) | string | date-time |
| TIMESTAMP(isAdjustedToUTC=true,unit=NANOS) | string | date-time |
Convert to integer
| ConvertedType | LogicalType | Schema Type | Schema Format |
|---|---|---|---|
| TIME_MILLIS | TIMESTAMP(isAdjustedToUTC=true,unit=MILLIS) | integer | int32 |
| TIME_MICROS | TIMESTAMP(isAdjustedToUTC=true,unit=MICROS) | integer | int64 |
| TIMESTAMP(isAdjustedToUTC=true,unit=NANOS) | integer | int64 |
Repeated Types
To read repeated types use the schema type array with the parquet type defined in items.
Example
Parquet definition:
message record {
required binary name (STRING);
required int32 age (INT(32,false));
repeated group friends {
required binary friend_name (STRING);
}
}
Decoder configuration:
decoder: parquet
schema:
type: object
properties:
name:
type: string
age:
type: integer
format: uint32
friends:
type: array
items:
type: object
properties:
name:
type: string
source: friend_name