Mastodawn

zartom Apr 19, 2025

How to Split Comma-Separated Values with Multiple Items in Brackets?
Efficiently parsing comma-separated values (CSV) in DB2 LUW strings, even with delimiters within the data, is now possible! Learn how XMLTABLE handles complex string formats for accurate data extraction. #DB2 #XMLTABLE #StringParsing #CSV #DataExtraction #DB2LUW
https://tech-champion.com/programming/how-to-split-comma-separated-values-with-multiple-items-in-brackets/
...

zartom Mar 29, 2025

How to Split Comma-Separated Values with Multiple Items in Brackets?
Efficiently parsing comma-separated values (CSV) in DB2 LUW strings, even with delimiters within the data, is now possible! Learn how XMLTABLE handles complex string formats for accurate data extraction. #DB2 #XMLTABLE #StringParsing #CSV #DataExtraction #DB2LUW
https://tech-champion.com/programming/how-to-split-comma-separated-values-with-multiple-items-in-brackets/
...

Yet Another OCM Mar 13, 2024

I demonstrate how to upload and extract information from an XML document that was exported from a Google Books bookshelf. Some data qaulity issues were detected with this bookshelf export. The final data model is a minor change from the model provided by Google Books: individual contributors are split out into a separate table; and the ISBN is used to identify books (where available).

Analysis of Exported Bookshelf

Google Books allows a Google user to create a bookshelf of various books discovered through Google Books. The contents of a bookshelf can be exported as an XML document. There is no XML schema published for this document.

My analysis of the XML document is here. The simplified structure is:

library
- list_title
- blurb
- url
- books
  - book
    - id
    - url
    - title
    - contributor
    - identifier
      - type
      - value
    - labels
      - label

The logical model for this Google Books bookshelf is:

Load Bookshelf Into XE Database

I created an Ansible workflow, called setup_demo_env.yml, to create the demonstration schema in Oracle XE database.

The loading of the XML export file is achieved through the use of an external table that uses SQL*Loader.

Extract Repeating Groups

There are two (2) repeating groups in each book:

Labels which is an XML sequence of label tags.

Contributor which a comma-separated list of authors.

Extract Labels from Repeating Group

The labels are extracted using a XQUERY FOR loop. See Extract Labels for a detailed description.

The SQL used was:

SELECT    l.*FROM  database_bookshelf_ext e,  XMLTABLE (    'for $i in /books/book/labels/label      return <row>        {          $i/../../id,          $i        }      </row>'    PASSING xmltype(e.object_value)    COLUMNS      "ID" VARCHAR2(16) PATH 'id',      label VARCHAR2(16) PATH 'label'    )                      l;

This SQL takes each row from the external table (DATABASE_BOOKSHELF_EXT) and converts the CLOB in the OBJECT_VALUE into XML using the XMLTABLE function. For each occurrence on the LABEL tag that is found by the XPATH /books/book/labels/label, I get the contents of the LABEL tag (which is pointed to by the $i loop variable), then I go up two (2) levels (to XPATH /books/book) and get the contents of the ID tag. These two (2) values are combined into a row for the XMLTABLE function.

The first few rows returned are:

ID               LABEL---------------- --------------------------------45fR_OeonfMC     Oracle45fR_OeonfMC     DatabasesuhSAAAAMAAJ     DatabasesuhSAAAAMAAJ     SecurityOhNBJ8_wMLwC     OracleOhNBJ8_wMLwC     DatabaseOhNBJ8_wMLwC     Security2egNzTk871wC     Database2egNzTk871wC     Relational Theory2egNzTk871wC     Logic2ImPFP6Yk64C     Database

Extract Contributor from Repeating Group

The XML document has a single CONTRIBUTOR tag for each book. For books with multiple authors, the contents of this tag is a comma-separated list. Apparently, there is no XQUERY function to values from a comma-separated list that has no commas as in the case of a single author. (It seems that XQUERY rejects the Indian invention of zero – even arrays start at one (1).)

Individual authors are extracted using a pipelined function which is implemented as the GET_CONTRIBUTOR Package and GET_CONTRIBUTOR Package Body. (See Extract Contributors.)

The heart of the pipelined function is:

LOOP    FETCH p_csr INTO in_rec;  -- input row    EXIT WHEN p_csr%notfound;    out_rec.id := in_rec.id;    l_num_contributor := regexp_count(in_rec.contributor_list, '[^,]+');    FOR l_idx_contributor IN 1..l_num_contributor LOOP        out_rec.contributor := ltrim(regexp_substr(in_rec.contributor_list, '[^,]+', 1, l_idx_contributor));        PIPE ROW ( out_rec );    END LOOP;END LOOP;

For each CONTRIBUTOR tag encountered, I use the REGEXP_SUBSTR to extract the nth occurrence of the required regular expression (which is a string of at least one (1) character that is not a comma, i.e. '[^,]+'). The REGEXP_COUNT gives the number of such substrings, and is used to drive the FOR loop.

For each substring that I find, I create a row and pass that to the calling procedure or SQL statement using the PIPE ROW statement.

This is invoked as follows:

SELECT    *FROM  TABLE (    contributor_pkg.get_contributor(      CURSOR(        SELECT            c.*          FROM            database_bookshelf_ext e,            XMLTABLE(              '/books/book'              PASSING xmltype(e.object_value)              COLUMNS                "ID" VARCHAR2(32) PATH 'id',                contributors VARCHAR2(1024) PATH 'contributor'            ) c        )    )  );

The innermost SELECT statement extracts a pair of ID and CONTRIBUTOR tag contents from each book (using the XPATH /books/book). The CURSOR funtctions allows the CONTRIBUTOR_PKG.GET_CONTRIBUTOR function to access this result set. The TABLE function converts the output from this function back into a SQL table. There is a lot of pluming involved!

The first few results are:

ID            CONTRIBUTORG9AJA91PL54C  Jonathan Lewis45fR_OeonfMC  Thomas Kyte1Aqw6mrxz5AC  Kerry Osborne1Aqw6mrxz5AC  Randy Johnson1Aqw6mrxz5AC  Tanel PõderlA3QygAACAAJ  Eric Redmond

Final Relational Model

I created materialised views to snapshot the data uploaded from the XML document. Querying the external table causes the SQL*Loader to read the XML document. The materialised views created were:

BOOKS

BOOK_LABELS

BOOK_CONTRIBUTORS

With the completion of materialised views, I composed a new logical database model for the DB_BOOKS schema:

There are three (3) entities:

BOOKS with attributes: