Parsing Data
Incoming data is parsed by setting a decoder and defining a schema to describe the data structure.
Decoders
Decoders are used to parse arbitrary data from various file formats.
Database engines will not require a decoder, however they might use a schema to describe the data
structure.
Decoder | Description |
---|---|
CSV | Parse CSV data. |
HTML | Parse HTML documents using XPath. |
JSON | Parse JSON data. |
MessagePack | Parse MsgPack data. |
Parquet | Parse Parquet data. |
Protocol Buffers | Parse Protocol Buffers data. |
Regular Expressions | Parse text using regular expressions. |
XML | Parse XML documents using XPath. |
Schema Basics
The schema format used to describe structured data is based on a sub-set of JSON Schema. See below for a description of the supported types.
String
The string
type is used for text.
Syntax
schema:
type: string
format: <format> # optional
Formats
If a string does not match the specified format, the entire record will be skipped.
Optimus Mine supports the following formats in the string
schema.
Format | Description |
---|---|
date-time |
Date and time as specified in RFC 3339, for example, 2018-11-13T20:20:39+07:00 . |
date |
Date, for example, 2018-11-13 . |
time |
Time, for example, 20:20:39 . |
Example
schema:
type: string
format: date-time
Integer
The integer
type is used for whole numbers (i.e. no fractions) and can be positive, negative or
zero.
Fractional components of numbers will be dropped e.g. 1.9
becomes 1
.
Syntax
schema:
type: integer
format: <format> # optional
Formats
Setting
format
oninteger
schemas changes the underlying type. Only set this if you know what you’re doing.
Optimus Mine supports the following formats in the integer
schema.
Format | Description |
---|---|
int32 |
Signed 32-bit integer. Range: -2,147,483,648 through 2,147,483,647 . Default on 32-bit machines. |
int64 |
Signed 64-bit integer. Range: -9,223,372,036,854,775,808 through 9,223,372,036,854,775,807 . Default on 64-bit machines. |
uint32 |
Unsigned 32-bit integer. Range: 0 through 4,294,967,295 . |
uint64 |
Unsigned 64-bit integer. Range: 0 through 18,446,744,073,709,551,615 . |
Example
schema:
type: integer
format: uint64
Number
The number
type is used for floating point numbers (i.e. numbers with a fractional component).
Syntax
schema:
type: number
format: <format> # optional
Formats
Setting
format
onnumber
schemas changes the underlying type. Only set this if you know what you’re doing.
Optimus Mine supports the following formats in the number
schema.
Format | Description |
---|---|
float32 |
IEEE-754 32-bit floating-point number. Default on 32-bit machines. |
float64 |
IEEE-754 64-bit floating-point number. Default on 64-bit machines. |
Example
schema:
type: number
format: float64
Boolean
The boolean
type can be either true
or false
. Booleans can be parsed from numbers & strings.
Values that are parsed as true
: 1
, t
, T
, true
, TRUE
, True
.
Values that are parsed as false
: 0
, f
, F
, false
, FALSE
, False
.
Syntax
schema:
type: boolean
Array
The array
type is used for lists of any single type. You could for example have an array of
booleans, strings, numbers etc. It is however not possible to have an array that contains multiple
types.
If the root schema is an array each element of the array will be considered a separate record. To
have an array in a single record you can wrap it in an object
type.
Syntax
schema:
type: array
items: <schema-definition>
Example
schema:
type: array
items:
type: string
Object
The object
type defines a key-value map. The keys must be strings.
Syntax
schema:
type: object
properties:
<property-name>: <schema-definition>
Example
schema:
type: object
properties:
name:
type: string
age:
type: integer