HTML Parser
The HTML parser is used for parsing incoming HTML data using XPath. To use it set decoder
to
html
and use a valid XPath path in your schema’s source
.
Any path
starting with a dot (.
) will be considered a relative path and use the parent schema’s
source
as the root.
If the parent schema has the array
type it will resolve to an element of that array.
Examples
Simple Example
Incoming data:
<!doctype html>
<html lang="en">
<head>
<title>Document</title>
</head>
<body>
<h1>Hello World</h1>
<p>Greetings traveller</p>
</body>
</html>
Decoder configuration:
decoder: html
schema:
type: object
properties:
title:
type: string
source: //h1
text:
type: string
source: /html/body/p
Output:
title: Hello World
text: Greetings traveller
Creating multiple records from a single page
Incoming data:
<!doctype html>
<html lang="en">
<head>
<title>Document</title>
</head>
<body>
<h1>Blog</h1>
<article>
<h2>First Article</h2>
<div>
<p>Some cool story</p>
</div>
</article>
<article>
<h2>Second Article</h2>
<div>
<p>Another cool story</p>
</div>
</article>
</body>
</html>
Decoder configuration:
decoder: html
schema:
type: array # as the root schema type is array each element will be outputted as a separate record
source: //body/article
items:
type: object
properties:
title:
type: string
source: ./h2
text:
type: string
source: ./div/p
Output:
- title: First Article
text: Some cool story
- name: Second Article
age: Another cool story