PulsarRPA has developed X-SQL to directly query the internet and convert web pages into tables and charts. X-SQL extends SQL to manage web data, including web crawling, data collection, data extraction, web content mining, web BI, and more.
When PulsarRPA runs as a REST service, X-SQL can be used to collect web pages or directly query the internet anytime, anywhere, without opening an IDE, just like an upgraded version of Google and Baidu.
Now, in large-scale data collection projects, all extraction rules (Chinese mirror) are written in X-SQL, with data type conversion and data cleaning also handled by the powerful X-SQL inline processing. The experience of writing X-SQL for data collection projects is as simple and efficient as traditional CRUD projects. A good example is x-asin.sql (Chinese mirror), which extracts more than 70 fields from each product page.
Exotic Amazon (Chinese mirror) is a complete solution for collecting the entire amazon.com website, ready to use out of the box, covering most data types on Amazon, and it will be permanently provided for free and open source. Thanks to the comprehensive web data management infrastructure provided by PulsarRPA, the entire solution consists of no more than 3500 lines of Kotlin code and less than 700 lines of X-SQL to extract over 650 fields.
This course introduces the basic concepts, basic usage, and the most common SQL functions of X-SQL.
A concise example is:
fun main() {
val context = SQLContexts.create()
val sql = """
select
dom_first_text(dom, '#productTitle') as title,
dom_first_text(dom, '#bylineInfo') as brand,
dom_first_text(dom, '#price tr td:matches(^Price) ~ td, #corePrice_desktop tr td:matches(^Price) ~ td') as price,
dom_first_text(dom, '#acrCustomerReviewText') as ratings,
str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score
from load_and_select('https://www.amazon.com/dp/B0C1H26C46 -i 1s -njr 3', 'body');
"""
val rs = context.executeQuery(sql)
println(ResultSetFormatter(rs, withHeader = true))
}
Complete code: kotlin, Chinese mirror.
X-SQL is based on the H2 database, and PulsarRPA mainly extends the H2 database by registering UDFs, but it also does other extensions.
Each X-SQL function has a namespace, for example:
dom_base_uri() -> dom
str_substring_after() -> str
In the above example, dom and str are namespaces. If an X-SQL function is declared as hasShortcut, the namespace can be ignored.
X-SQL functions are not case-sensitive, and all underscores ( _ ) are ignored.
The following X-SQL functions are the same:
DOM_LOAD_AND_SELECT(url, 'body');
dom_loadAndSelect(url, 'body');
Dom_Load_And_Select(url, 'body');
dom_load_and_select(url, 'body');
dOm_____lo_AdaNd_S___elEct_____(url, 'body');
Since LOAD_AND_SELECT is declared as hasShortcut, the namespace can be ignored, and the following functions are still the same:
LOAD_AND_SELECT(url, 'body');
loadAndSelect(url, 'body');
Load_And_Select(url, 'body');
load_and_select(url, 'body');
_____lo_AdaNd_S___elEct_____(url, 'body');
Each table function returns a ResultSet and can appear in the from clause.
LOAD_AND_SELECT(url [cssSelector [, offset [, limit]]])
Load a page and select elements, returning a ResultSet. The resulting ResultSet has two columns: DOM and DOC, both of which are of type ValueDom.
Example:
select
dom_base_uri(dom)
from
load_and_select('https://www.amazon.com/dp/B0C1H26C46', 'body', 1, 10)
DOM functions are designed to query DOM attributes. Each DOM function accepts a ValueDom argument, which is a wrapper for a Jsoup Element. DOM functions are defined in the following file: DomFunctions, and all DOM functions are in the namespace DOM.
DOM_BASE_URI(dom)
Returns the URI of the HTML document.
Example:
select dom_base_uri(dom) from load_and_select('https://www.amazon.com/dp/B0C1H26C46', 'body')
DOM selection functions are designed to query elements and their attributes from the DOM. Each DOM function accepts a parameter named DOM (case-insensitive), of type ValueDom, which is a wrapper for a Jsoup Element. DOM selection functions usually also accept a cssSelector parameter to select a child element of the DOM. The most important DOM selection functions are defined in the following file: DomSelectFunctions. All DOM selection functions are in the namespace DOM.
DOM_FIRST_TEXT(dom, cssSelector)
Returns the text content of the first element selected by cssSelector within dom, similar to the following JavaScript code.
dom.querySelector(cssSelector).textContent
Example:
select
dom_first_text(dom, '#productName') as Name,
dom_first_text(dom, '#price') as Price,
dom_first_text(dom, '#star') as StarNum
from
load_and_select('https://www.example.com/zgbs/appliances', 'ul.item-collection li.item')
DOM_ALL_TEXTS(dom, cssSelector)
Returns an array of text content from all elements selected by cssSelector within dom, similar to the following JavaScript pseudocode.
dom.querySelectorAll(cssSelector).map(e => e.textContent)
Example:
select
dom_all_texts(dom, 'ul li.item a.name') as ProductNames,
dom_all_texts(dom, 'ul li.item span.price') as ProductPrices,
dom_all_texts(dom, 'ul li.item span.star') as ProductStars
from
load_and_select('https://www.example.com/zgbs/appliances', 'div.products')
Most string functions are automatically converted from org.apache.commons.lang3.StringUtils through programming. You can find the UDF definitions in the following file: StringFunctions (Chinese mirror), and all string functions are in the namespace STR.
STR_SUBSTRING_AFTER(str, separator)
Get the substring after the first occurrence of the separator.
Example:
select
str_substring_after(dom_first_text(dom, '#price'), '$') as Price
from
load_and_select('https://www.amazon.com/dp/B0C1H26C46', 'body');