XPath Parser
The XPath parser is used for parsing incoming HTML data using XPath. To use it set engine
to
xpath
and use a valid XPath path in your schema’s source
.
Any path
starting with a dot (.
) will be considered a relative path and use the parent schema’s
source
as the root.
If the parent schema has the array
type it will resolve to an element of that array.
Examples
Simple Example
Content of https://example.com/
:
<!doctype html>
<html lang="en">
<head>
<title>Document</title>
</head>
<body>
<h1>Hello World</h1>
<p>Greetings traveller</p>
</body>
</html>
Optimus task config:
tasks:
- task: scrape
input: https://example.com/
engine: xpath
schema:
type: object
properties:
title:
type: string
source: //h1
text:
type: string
source: /html/body/p
Output:
title: Hello World
text: Greetings traveller
Creating multiple records from a single page
Content of https://example.com/blog.html
:
<!doctype html>
<html lang="en">
<head>
<title>Document</title>
</head>
<body>
<h1>Blog</h1>
<article>
<h2>First Article</h2>
<div>
<p>Some cool story</p>
</div>
</article>
<article>
<h2>Second Article</h2>
<div>
<p>Another cool story</p>
</div>
</article>
</body>
</html>
Optimus task config:
tasks:
- task: scrape
input: https://example.com/blog.html
engine: xpath
schema:
type: array # as the root schema type is array each element will be outputted as a separate record
source: //body/article
items:
type: object
properties:
title:
type: string
source: ./h2
text:
type: string
source: ./div/p
Output:
- title: First Article
text: Some cool story
- name: Second Article
age: Another cool story