Parsing Data

Incoming data is parsed by setting a decoder and defining a schema to describe the data structure.

Decoders

Decoders are used to parse arbitrary data from various file formats.
Database engines will not require a decoder, however they might use a schema to describe the data structure.

Decoder Description
CSV Parse CSV data.
HTML Parse HTML documents using XPath.
JSON Parse JSON data.
MessagePack Parse MsgPack data.
Parquet Parse Parquet data.
Protocol Buffers Parse Protocol Buffers data.
Regular Expressions Parse text using regular expressions.
XML Parse XML documents using XPath.

Schema Basics

The schema format used to describe structured data is based on a sub-set of JSON Schema. See below for a description of the supported types.

String

The string type is used for text.

Syntax

schema:
  type: string
  format: <format> # optional

Formats

If a string does not match the specified format, the entire record will be skipped.

Optimus Mine supports the following formats in the string schema.

Format Description
date-time Date and time as specified in RFC 3339, for example, 2018-11-13T20:20:39+07:00.
date Date, for example, 2018-11-13.
time Time, for example, 20:20:39.

Example

schema:
  type: string
  format: date-time

Integer

The integer type is used for whole numbers (i.e. no fractions) and can be positive, negative or zero.

Fractional components of numbers will be dropped e.g. 1.9 becomes 1.

Syntax

schema:
  type: integer
  format: <format> # optional

Formats

Setting format on integer schemas changes the underlying type. Only set this if you know what you’re doing.

Optimus Mine supports the following formats in the integer schema.

Format Description
int32 Signed 32-bit integer. Range: -2,147,483,648 through 2,147,483,647. Default on 32-bit machines.
int64 Signed 64-bit integer. Range: -9,223,372,036,854,775,808 through 9,223,372,036,854,775,807. Default on 64-bit machines.
uint32 Unsigned 32-bit integer. Range: 0 through 4,294,967,295.
uint64 Unsigned 64-bit integer. Range: 0 through 18,446,744,073,709,551,615.

Example

schema:
  type: integer
  format: uint64

Number

The number type is used for floating point numbers (i.e. numbers with a fractional component).

Syntax

schema:
  type: number
  format: <format> # optional

Formats

Setting format on number schemas changes the underlying type. Only set this if you know what you’re doing.

Optimus Mine supports the following formats in the number schema.

Format Description
float32 IEEE-754 32-bit floating-point number. Default on 32-bit machines.
float64 IEEE-754 64-bit floating-point number. Default on 64-bit machines.

Example

schema:
  type: number
  format: float64

Boolean

The boolean type can be either true or false. Booleans can be parsed from numbers & strings.

Values that are parsed as true: 1, t, T, true, TRUE, True.
Values that are parsed as false: 0, f, F, false, FALSE, False.

Syntax

schema:
  type: boolean

Array

The array type is used for lists of any single type. You could for example have an array of booleans, strings, numbers etc. It is however not possible to have an array that contains multiple types.

If the root schema is an array each element of the array will be considered a separate record. To have an array in a single record you can wrap it in an object type.

Syntax

schema:
  type: array
  items: <schema-definition>

Example

schema:
  type: array
  items:
    type: string

Object

The object type defines a key-value map. The keys must be strings.

Syntax

schema:
  type: object
  properties:
    <property-name>: <schema-definition>

Example

schema:
  type: object
  properties:
    name:
      type: string
    age:
      type: integer