XPath Parser

The XPath parser is used for parsing incoming HTML data using XPath. To use it set engine to xpath and use a valid XPath path in your schema’s source.

Any path starting with a dot (.) will be considered a relative path and use the parent schema’s source as the root.
If the parent schema has the array type it will resolve to an element of that array.

Examples

Simple Example

Content of https://example.com/:

<!doctype html>
<html lang="en">
  <head>
    <title>Document</title>
  </head>
  <body>
    <h1>Hello World</h1>
    <p>Greetings traveller</p>
  </body>
</html>

Optimus task config:

tasks:
  - task: scrape
    input: https://example.com/
    engine: xpath
    schema:
      type: object
      properties:
        title:
          type: string
          source: //h1
        text:
          type: string
          source: /html/body/p

Output:

title: Hello World
text: Greetings traveller

Creating multiple records from a single page

Content of https://example.com/blog.html:

<!doctype html>
<html lang="en">
  <head>
    <title>Document</title>
  </head>
  <body>
    <h1>Blog</h1>
    <article>
      <h2>First Article</h2>
      <div>
        <p>Some cool story</p>
      </div>
    </article>
    <article>
      <h2>Second Article</h2>
      <div>
        <p>Another cool story</p>
      </div>
    </article>
  </body>
</html>

Optimus task config:

tasks:
  - task: scrape
    input: https://example.com/blog.html
    engine: xpath
    schema:
      type: array # as the root schema type is array each element will be outputted as a separate record
      source: //body/article
      items:
        type: object
        properties:
          title:
            type: string
            source: ./h2
          text:
            type: string
            source: ./div/p

Output:

- title: First Article
  text: Some cool story

- name: Second Article
  age: Another cool story