HTML Parser

The HTML parser is used for parsing incoming HTML data using XPath. To use it set decoder to html and use a valid XPath path in your schema’s source.

Any path starting with a dot (.) will be considered a relative path and use the parent schema’s source as the root.
If the parent schema has the array type it will resolve to an element of that array.

Examples

Simple Example

Incoming data:

<!doctype html>
<html lang="en">
  <head>
    <title>Document</title>
  </head>
  <body>
    <h1>Hello World</h1>
    <p>Greetings traveller</p>
  </body>
</html>

Decoder configuration:

decoder: html
schema:
  type: object
  properties:
    title:
      type: string
      source: //h1
    text:
      type: string
      source: /html/body/p

Output:

title: Hello World
text: Greetings traveller

Creating multiple records from a single page

Incoming data:

<!doctype html>
<html lang="en">
  <head>
    <title>Document</title>
  </head>
  <body>
    <h1>Blog</h1>
    <article>
      <h2>First Article</h2>
      <div>
        <p>Some cool story</p>
      </div>
    </article>
    <article>
      <h2>Second Article</h2>
      <div>
        <p>Another cool story</p>
      </div>
    </article>
  </body>
</html>

Decoder configuration:

decoder: html
schema:
  type: array # as the root schema type is array each element will be outputted as a separate record
  source: //body/article
  items:
    type: object
    properties:
      title:
        type: string
        source: ./h2
      text:
        type: string
        source: ./div/p

Output:

- title: First Article
  text: Some cool story

- name: Second Article
  age: Another cool story