I demonstrate how to upload and extract information from an XML document that was exported from a Google Books bookshelf. Some data qaulity issues were detected with this bookshelf export. The final data model is a minor change from the model provided by Google Books: individual contributors are split out into a separate table; and the ISBN is used to identify books (where available).
Analysis of Exported Bookshelf
Google Books allows a Google user to create a bookshelf of various books discovered through Google Books. The contents of a bookshelf can be exported as an XML document. There is no XML schema published for this document.
My analysis of the XML document is here. The simplified structure is:
- library
- list_title
- blurb
- url
- books
- book
- id
- url
- title
- contributor
- identifier
- labels
The logical model for this Google Books bookshelf is:
Load Bookshelf Into XE Database
I created an Ansible workflow, called setup_demo_env.yml, to create the demonstration schema in Oracle XE database.
The loading of the XML export file is achieved through the use of an external table that uses SQL*Loader.
Extract Repeating Groups
There are two (2) repeating groups in each book:
Labels which is an XML sequence of label tags.Contributor which a comma-separated list of authors.Extract Labels from Repeating Group
The labels are extracted using a XQUERY FOR loop. See Extract Labels for a detailed description.
The SQL used was:
SELECT l.*FROM database_bookshelf_ext e, XMLTABLE ( 'for $i in /books/book/labels/label return <row> { $i/../../id, $i } </row>' PASSING xmltype(e.object_value) COLUMNS "ID" VARCHAR2(16) PATH 'id', label VARCHAR2(16) PATH 'label' ) l;This SQL takes each row from the external table (DATABASE_BOOKSHELF_EXT) and converts the CLOB in the OBJECT_VALUE into XML using the XMLTABLE function. For each occurrence on the LABEL tag that is found by the XPATH /books/book/labels/label, I get the contents of the LABEL tag (which is pointed to by the $i loop variable), then I go up two (2) levels (to XPATH /books/book) and get the contents of the ID tag. These two (2) values are combined into a row for the XMLTABLE function.
The first few rows returned are:
ID LABEL---------------- --------------------------------45fR_OeonfMC Oracle45fR_OeonfMC DatabasesuhSAAAAMAAJ DatabasesuhSAAAAMAAJ SecurityOhNBJ8_wMLwC OracleOhNBJ8_wMLwC DatabaseOhNBJ8_wMLwC Security2egNzTk871wC Database2egNzTk871wC Relational Theory2egNzTk871wC Logic2ImPFP6Yk64C DatabaseExtract Contributor from Repeating Group
The XML document has a single CONTRIBUTOR tag for each book. For books with multiple authors, the contents of this tag is a comma-separated list. Apparently, there is no XQUERY function to values from a comma-separated list that has no commas as in the case of a single author. (It seems that XQUERY rejects the Indian invention of zero – even arrays start at one (1).)
Individual authors are extracted using a pipelined function which is implemented as the GET_CONTRIBUTOR Package and GET_CONTRIBUTOR Package Body. (See Extract Contributors.)
The heart of the pipelined function is:
LOOP FETCH p_csr INTO in_rec; -- input row EXIT WHEN p_csr%notfound; out_rec.id := in_rec.id; l_num_contributor := regexp_count(in_rec.contributor_list, '[^,]+'); FOR l_idx_contributor IN 1..l_num_contributor LOOP out_rec.contributor := ltrim(regexp_substr(in_rec.contributor_list, '[^,]+', 1, l_idx_contributor)); PIPE ROW ( out_rec ); END LOOP;END LOOP;For each CONTRIBUTOR tag encountered, I use the REGEXP_SUBSTR to extract the nth occurrence of the required regular expression (which is a string of at least one (1) character that is not a comma, i.e. '[^,]+'). The REGEXP_COUNT gives the number of such substrings, and is used to drive the FOR loop.
For each substring that I find, I create a row and pass that to the calling procedure or SQL statement using the PIPE ROW statement.
This is invoked as follows:
SELECT *FROM TABLE ( contributor_pkg.get_contributor( CURSOR( SELECT c.* FROM database_bookshelf_ext e, XMLTABLE( '/books/book' PASSING xmltype(e.object_value) COLUMNS "ID" VARCHAR2(32) PATH 'id', contributors VARCHAR2(1024) PATH 'contributor' ) c ) ) );The innermost SELECT statement extracts a pair of ID and CONTRIBUTOR tag contents from each book (using the XPATH /books/book). The CURSOR funtctions allows the CONTRIBUTOR_PKG.GET_CONTRIBUTOR function to access this result set. The TABLE function converts the output from this function back into a SQL table. There is a lot of pluming involved!
The first few results are:
ID CONTRIBUTORG9AJA91PL54C Jonathan Lewis45fR_OeonfMC Thomas Kyte1Aqw6mrxz5AC Kerry Osborne1Aqw6mrxz5AC Randy Johnson1Aqw6mrxz5AC Tanel PõderlA3QygAACAAJ Eric RedmondFinal Relational Model
I created materialised views to snapshot the data uploaded from the XML document. Querying the external table causes the SQL*Loader to read the XML document. The materialised views created were:
BOOKSBOOK_LABELSBOOK_CONTRIBUTORSWith the completion of materialised views, I composed a new logical database model for the DB_BOOKS schema:
There are three (3) entities:
BOOKS with attributes:ID is primary keyTITLE is mandatoryURL is mandatory
BOOKS LABELS with attributes:ID is mandatory and refers to BOOKS.IDLABEL is mandatoryID and LABEL form the primary key
BOOK CONTRIBUTORSID is mandatory and refers to BOOKS.IDCONTRIBUTOR is mandatoryID and CONTRIBUTOR form the primary key
https://yaocm.wordpress.com/2024/03/14/using-xml-in-oracle-db-for-google-books/
#ExternalTable #XEDatabase #XML #XMLTABLE