How to Split Comma-Separated Values with Multiple Items in Brackets?
Efficiently parsing comma-separated values (CSV) in DB2 LUW strings, even with delimiters within the data, is now possible! Learn how XMLTABLE handles complex string formats for accurate data extraction. #DB2 #XMLTABLE #StringParsing #CSV #DataExtraction #DB2LUW
https://tech-champion.com/programming/how-to-split-comma-separated-values-with-multiple-items-in-brackets/
...
How to Split Comma-Separated Values with Multiple Items in Brackets?
Efficiently parsing comma-separated values (CSV) in DB2 LUW strings, even with delimiters within the data, is now possible! Learn how XMLTABLE handles complex string formats for accurate data extraction. #DB2 #XMLTABLE #StringParsing #CSV #DataExtraction #DB2LUW
https://tech-champion.com/programming/how-to-split-comma-separated-values-with-multiple-items-in-brackets/
...

I demonstrate how to upload and extract information from an XML document that was exported from a Google Books bookshelf. Some data qaulity issues were detected with this bookshelf export. The final data model is a minor change from the model provided by Google Books: individual contributors are split out into a separate table; and the ISBN is used to identify books (where available).

Analysis of Exported Bookshelf

Google Books allows a Google user to create a bookshelf of various books discovered through Google Books. The contents of a bookshelf can be exported as an XML document. There is no XML schema published for this document.

My analysis of the XML document is here. The simplified structure is:

  • library
    • list_title
    • blurb
    • url
    • books
      • book
        • id
        • url
        • title
        • contributor
        • identifier
          • type
          • value
        • labels
          • label

The logical model for this Google Books bookshelf is:

Load Bookshelf Into XE Database

I created an Ansible workflow, called setup_demo_env.yml, to create the demonstration schema in Oracle XE database.

The loading of the XML export file is achieved through the use of an external table that uses SQL*Loader.

Extract Repeating Groups

There are two (2) repeating groups in each book:

  • Labels which is an XML sequence of label tags.
  • Contributor which a comma-separated list of authors.
  • Extract Labels from Repeating Group

    The labels are extracted using a XQUERY FOR loop. See Extract Labels for a detailed description.

    The SQL used was:

    SELECT l.*FROM database_bookshelf_ext e, XMLTABLE ( 'for $i in /books/book/labels/label return <row> { $i/../../id, $i } </row>' PASSING xmltype(e.object_value) COLUMNS "ID" VARCHAR2(16) PATH 'id', label VARCHAR2(16) PATH 'label' ) l;

    This SQL takes each row from the external table (DATABASE_BOOKSHELF_EXT) and converts the CLOB in the OBJECT_VALUE into XML using the XMLTABLE function. For each occurrence on the LABEL tag that is found by the XPATH /books/book/labels/label, I get the contents of the LABEL tag (which is pointed to by the $i loop variable), then I go up two (2) levels (to XPATH /books/book) and get the contents of the ID tag. These two (2) values are combined into a row for the XMLTABLE function.

    The first few rows returned are:

    ID LABEL---------------- --------------------------------45fR_OeonfMC Oracle45fR_OeonfMC DatabasesuhSAAAAMAAJ DatabasesuhSAAAAMAAJ SecurityOhNBJ8_wMLwC OracleOhNBJ8_wMLwC DatabaseOhNBJ8_wMLwC Security2egNzTk871wC Database2egNzTk871wC Relational Theory2egNzTk871wC Logic2ImPFP6Yk64C Database

    Extract Contributor from Repeating Group

    The XML document has a single CONTRIBUTOR tag for each book. For books with multiple authors, the contents of this tag is a comma-separated list. Apparently, there is no XQUERY function to values from a comma-separated list that has no commas as in the case of a single author. (It seems that XQUERY rejects the Indian invention of zero – even arrays start at one (1).)

    Individual authors are extracted using a pipelined function which is implemented as the GET_CONTRIBUTOR Package and GET_CONTRIBUTOR Package Body. (See Extract Contributors.)

    The heart of the pipelined function is:

    LOOP FETCH p_csr INTO in_rec; -- input row EXIT WHEN p_csr%notfound; out_rec.id := in_rec.id; l_num_contributor := regexp_count(in_rec.contributor_list, '[^,]+'); FOR l_idx_contributor IN 1..l_num_contributor LOOP out_rec.contributor := ltrim(regexp_substr(in_rec.contributor_list, '[^,]+', 1, l_idx_contributor)); PIPE ROW ( out_rec ); END LOOP;END LOOP;

    For each CONTRIBUTOR tag encountered, I use the REGEXP_SUBSTR to extract the nth occurrence of the required regular expression (which is a string of at least one (1) character that is not a comma, i.e. '[^,]+'). The REGEXP_COUNT gives the number of such substrings, and is used to drive the FOR loop.

    For each substring that I find, I create a row and pass that to the calling procedure or SQL statement using the PIPE ROW statement.

    This is invoked as follows:

    SELECT *FROM TABLE ( contributor_pkg.get_contributor( CURSOR( SELECT c.* FROM database_bookshelf_ext e, XMLTABLE( '/books/book' PASSING xmltype(e.object_value) COLUMNS "ID" VARCHAR2(32) PATH 'id', contributors VARCHAR2(1024) PATH 'contributor' ) c ) ) );

    The innermost SELECT statement extracts a pair of ID and CONTRIBUTOR tag contents from each book (using the XPATH /books/book). The CURSOR funtctions allows the CONTRIBUTOR_PKG.GET_CONTRIBUTOR function to access this result set. The TABLE function converts the output from this function back into a SQL table. There is a lot of pluming involved!

    The first few results are:

    ID CONTRIBUTORG9AJA91PL54C Jonathan Lewis45fR_OeonfMC Thomas Kyte1Aqw6mrxz5AC Kerry Osborne1Aqw6mrxz5AC Randy Johnson1Aqw6mrxz5AC Tanel PõderlA3QygAACAAJ Eric Redmond

    Final Relational Model

    I created materialised views to snapshot the data uploaded from the XML document. Querying the external table causes the SQL*Loader to read the XML document. The materialised views created were:

  • BOOKS
  • BOOK_LABELS
  • BOOK_CONTRIBUTORS
  • With the completion of materialised views, I composed a new logical database model for the DB_BOOKS schema:

    There are three (3) entities:

  • BOOKS with attributes:
    • ID is primary key
    • TITLE is mandatory
    • URL is mandatory
  • BOOKS LABELS with attributes:
    • ID is mandatory and refers to BOOKS.ID
    • LABEL is mandatory
    • ID and LABEL form the primary key
  • BOOK CONTRIBUTORS
    • ID is mandatory and refers to BOOKS.ID
    • CONTRIBUTOR is mandatory
    • ID and CONTRIBUTOR form the primary key
  • https://yaocm.wordpress.com/2024/03/14/using-xml-in-oracle-db-for-google-books/

    #ExternalTable #XEDatabase #XML #XMLTABLE