Below is the raw data used throughout this chapter. The data includes entries for people, the cars they own, and various maintenance services performed on those cars:
<?xml version="1.0" encoding="UTF-8" ?><root> <rootAttr1>rootValue1</rootAttr1> <people> <personal> <age>20</age> <gender>M</gender> <name> <first>John</first> <last>Doe</last> </name> </personal> <jobs>support</jobs> <jobs>coding</jobs> <vehicles> <type>car</type> <model>Honda Civic</model> <insurance> <company>ABC Insurance</company> <policy_num>12345</policy_num> </insurance> <features>sunroof</features> <features>rims</features> <maintenance> <date>07-17-2017</date> <desc>oil change</desc> </maintenance> <maintenance> <date>01-03-2018</date> <desc>new tires</desc> </maintenance> </vehicles> <vehicles> <type>truck</type> <model>Dodge Ram</model> <insurance> <company>ABC Insurance</company> <policy_num>12345</policy_num> </insurance> <features>lift kit</features> <features>tow package</features> <maintenance> <date>08-27-2017</date> <desc>new tires</desc> </maintenance> <maintenance> <date>01-08-2018</date> <desc>oil change</desc> </maintenance> </vehicles> <addresses> <type>work</type> <zip>12345</zip> </addresses> <addresses> <type>home</type> <zip>12357</zip> </addresses> <source>internet</source> </people> <people> <personal> <age>24</age> <gender>F</gender> <name> <first>Jane</first> <last>Roberts</last> </name> </personal> <jobs>sales</jobs> <jobs>marketing</jobs> <source>phone</source> <vehicles> <type>car</type> <model>Toyota Camry</model> <insurance> <company>Car Insurance</company> <policy_num>98765</policy_num> </insurance> <features>upgraded stereo</features> <maintenance> <date>05-11-2017</date> <desc>tires rotated</desc> </maintenance> <maintenance> <date>11-03-2017</date> <desc>oil change</desc> </maintenance> </vehicles> <vehicles> <type>car</type> <model>Honda Accord</model> <insurance> <company>Car Insurance</company> <policy_num>98765</policy_num> </insurance> <features>custom paint</features> <features>custom wheels</features> <maintenance> <date>10-07-2017</date> <desc>new air filter</desc> </maintenance> <maintenance> <date>01-13-2018</date> <desc>new brakes</desc> </maintenance> </vehicles> <addresses> <type>home</type> <zip>98765</zip> </addresses> <addresses> <type>work</type> <zip>98753</zip> </addresses> </people> <rootAttr2>rootValue2</rootAttr2> <rootAttr3>rootValue3</rootAttr3> <rootAttr3>rootValue4</rootAttr3></root>
After connecting to your data source, set DataModel to more closely match the data representation to the structure of your data.
Below are example connection strings to XML files or streams, using the provider's default data modeling configuration (see below):
The DataModel property is the controlling property over how your data is represented into tables and toggles the following basic configurations.
Document (default): Model a top-level, document view of your XML data. The provider returns nested elements as aggregated XML.
FlattenedDocuments: Detect nested documents and implicitly join them into a single table.
Relational: Return individual, related tables from hierarchical data. The tables contain a primary key and a foreign key that links to the parent document.
Service provider
URI formats
Connection example
Local
localPath
file:///localPath/file.xml
URI=C:\folder1\file.xml;
HTTP or HTTPS
http://remoteStream
https://remoteStream
URI=http://www.host1.com/streamname1;
Amazon S3
s3://bucket1/folder1/file.xml
URI=s3://bucket1/folder1/file.xml; AWSAccessKey=token1; AWSSecretKey=secret1; AWSRegion=OHIO;
Azure Blob Storage
azureblob://mycontainer/myblob/
URI=azureblob://mycontainer/myblob; AzureAccount=myAccount; AzureAccessKey=myKey;
Google Drive
gdrive://remotePath/file.xml
URI=gdrive://folder1/file.xml;InitiateOAuth=GETANDREFRESH;
Box
box://remotePath/file.xml
URI=box://folder1/file.xml; InitiateOAuth=GETANDREFRESH;
Dropbox
dropbox://remotePath/file.xml
URI=dropbox://folder1/file.xml; InitiateOAuth=GETANDREFRESH; OAuthClientId=oauthclientid1; OAuthClientSecret=oauthcliensecret1; CallbackUrl=http://localhost:12345;
Sharepoint
sp://remotePath/file.xml
URI=sp://Documents/folder1/file.xml; User=user1; Password=password1; SharepointUrl=https://subdomain.sharepoint.com;
FTP or FTPS
ftp://server:port/remotePath/file.xml
ftps://server:port/remotepath/file.xml
URI=ftps://localhost:990/folder1/file.xml; User=user1; Password=password1;
AzureDataLakeStoreGen2
abfs://myfilesystem/remotePath/file.xml
abfss://myfilesystem@accountName.dfs.core.windows.net/remotepath/file.xml
URI=abfs://myfilesystem/folder1/file.xml; AzureAccount=myAccount; AzureAccessKey=myKey;
URI=abfss://myfilesystem@myAccount.dfs.core.windows.net/folder1/file.xml; AzureAccessKey=myKey;
AzureDataLakeStoreGen2 with SSL
abfss://myfilesystem/remotePath/file.xml
abfs://myfilesystem@accountName.dfs.core.windows.net/remotepath/file.xml
URI=abfss://myfilesystem/folder1/file.xml; AzureAccount=myAccount; AzureAccessKey=myKey;
URI=abfss://myfilesystem@myAccount.dfs.core.windows.net/folder1/file.xml; AzureAccessKey=myKey;
Wasabi
wasabi://bucket/remotePath/file.xml
URI=wasabi://bucket/folder1/file.xml; AWSAccessKey=token1; AWSSecretKey=secret1; AWSRegion=NorthenVirginia;
To connect with the provider's data modeling defaults, set the URI of the XML resource and authenticate. This section provides connection and authentication guides for the available XML data sources. See Connecting to XML Data Sources to control the data representation.
Here is an overview of the authentication properties corresponding to each data source:
Set the URI to an XML file. Below is an example connection string:
Set the URI to the HTTP or HTTPS URL of the XML resource you want to access as a table. Set AuthScheme to use the following authentication types. The provider also supports OAuth authentication; see Using OAuth for more information.
HTTP:To use HTTP Basic or Digest, set the User and Password and set the corresponding AuthScheme. Set CustomHeaders if you need access to the request headers. Set CustomUrlParams to modify the URL query string.
Windows (NTLM): Set the Windows User and Password to connect and set AuthScheme to "NTLM".
Kerberos and Kerberos Delegation: To authenticate with Kerberos, set the User and Password and set AuthScheme to NEGOTIATE. To use Kerberos Delegation, set AuthScheme to KERBEROSDELEGATION.
For example:
Set the URI to an XML document in a bucket. Additionally, set the following properties to authenticate:
AWSAccessKey: Set to an Amazon Web Services Access Key (a user name).
AWSSecretKey: Set to an Amazon Web Services Secret Key.
For example:
Optionally, specify AWSRegion.
Set the URI to the bucket and folder. Additionally, set the following properties to authenticate:
AWSAccessKey: Set this to an Wasabi Access Key (a username)
AWSSecretKey: Set this to an Wasabi Secret Key.
For example:
Optionally, specify AWSRegion in addition.
Set the URI to the name of your container and the name of the blob. Additionally, set the following properties to authenticate:
AzureAccount: Set this to the account associated with the Azure blob.
AzureAccessKey: Set the to the access key associated with the Azure blob.
For example:
Set the URI to the path to an XML file and authenticate using the OAuth standard. See Connecting to Google Drive for an authentication guide. You can connect with a user account or a service account. In the user account flow, you do not need to set any connection properties for your user credentials, as shown in the following connection string:
Set the URI to the path to an XML file and authenticate using the OAuth standard. See Connecting to Box for an authentication guide. You can authenticate with a user account or a service account. In the user account flow, you do not need to set any connection properties for your user credentials, as shown in the following connection string:
Set the URI to the path to an XML file and authenticate using the OAuth standard. You can authenticate with a user account or a service account. In the user account flow, you do not need to set any connection properties for your user credentials, as shown in the following connection string:
Set the URI to a document library containing XML files. To authenticate, set User and Password and SharepointUrl.
Set the URI to the address of the server followed by the path to an XML file. To authenticate, set User and Password.
By default, the provider attempts to negotiate SSL/TLS by checking the server's certificate against the system's trusted certificate store. To specify another certificate, see the SSLServerCert property for the available formats to do so.
Service provider
URI formats
InitiateOAuth
OAuthClientId
OAuthClientSecret
OAuthAccessToken
OAuthAccessTokenSecret
User
Password
AuthScheme
AzureAccount
AzureAccessKey
AWSAccessKey
AWSSecretKey
AWSRegion
Local
localPath
file://localPath
HTTP or HTTPS
http://remoteStream
https://remoteStream
OPTIONAL
OPTIONAL
OPTIONAL
Amazon S3
s3://remotePath
REQUIRED (your AccessKey)
REQUIRED (your SecretKey)
OPTIONAL
Azure Blob Storage
azureblob://mycontainer/myblob/
REQUIRED
REQUIRED (your AccessKey)
Google Drive
gdrive://remotePath
OPTIONAL
OPTIONAL
OPTIONAL
OPTIONAL
OPTIONAL
Box
box://remotePath
OPTIONAL
REQUIRED
REQUIRED
OPTIONAL
OPTIONAL
Dropbox
dropbox://remotePath
OPTIONAL
REQUIRED
REQUIRED
OPTIONAL
OPTIONAL
SharePoint Online
sp://remotePath
REQUIRED
REQUIRED
FTP or FTPS
ftp://server:port/remotePath
ftps://server:port/remotepath
REQUIRED
REQUIRED
Wasabi
wasabi://bucket1/remotePath;
REQUIRED (your AccessKey)
REQUIRED (your SecretKey)
OPTIONAL
URI=C:\folder1\file.xml;
URI=http://www.host1.com/streamname1;AuthScheme=BASIC;User=admin;Password=admin
URI=s3://bucket1/folder1/file.xml; AWSAccessKey=token1; AWSSecretKey=secret1; AWSRegion=OHIO;
URI=wasabi://bucket1/folder1/file.xml; AWSAccessKey=token1; AWSSecretKey=secret1; AWSRegion=OHIO;
URI=azureblob://mycontainer/myblob; AzureAccount=myAccount; AzureAccessKey=myKey;
URI=gdrive://folder1/file.xml;InitiateOAuth=GETANDREFRESH;
URI=box://folder1/file.xml; InitiateOAuth=GETANDREFRESH;
URI=dropbox://folder1/file.xml; InitiateOAuth=GETANDREFRESH; OAuthClientId=oauthclientid1; OAuthClientSecret=oauthcliensecret1; CallbackUrl=http://localhost:12345;
URI=sp://Documents/folder1/file.xml; User=user1; Password=password1; SharepointUrl=https://subdomain.sharepoint.com;
URI=ftps://localhost:990/folder1/file.xml; User=user1; Password=password1;
The provider offers three basic configurations to model nested data in XML as tables.
For users who simply need access to the entirety of their XML data, flattening the data into a single table is the best option. The provider will use streaming and only parses the XML data once per query in this mode.
With DataModel set to "FlattenedDocuments", the provider returns a separate table for each object array, but implicitly JOINed to the parent table. Any nested sibling XPath values (child paths at the same height) will be treated as a SQL CROSS JOIN.
Below is a sample query and the results, based on the sample document in Raw Data and parsing based on the XPaths /root/people, /root/people/vehicles, and /root/people/vehicles/maintenance. This implicitly JOINs the people element with the vehicles element and implicitly JOINs the vehicles element with the maintenance element.
Use the following connection string to query the Raw Data in this example.
The following query drills into the nested elements in each people element. Since the XPath property included the vehicles node as an XML path, you can query an element of a vehicle explicitly.
With horizontal and vertical flattening based on the described paths, each vehicle element is implicitly JOINed to its parent people element and each maintenance element is implicitly JOINed to its parent vehicle element.
The provider for XML can be configured to create a relational model of the data in the XML file or source, treating each XPath as an individual table containing a primary key and a foreign key that links to the parent document. This is particularly useful if you need to work with your XML data in existing BI, reporting, and ETL tools that expect a relational data model.
With DataModel set to "Relational", any JOINs are controlled by the query. Any time you perform a JOIN query, the XML file or source will be queried once for each table included in the query.
Below is a sample query against the sample document in Raw Data, using a relational model based on the XML paths "/root/people", "/root/people/vehicles", and "/root/people/vehicles/maintenance".
Set the DataModel connection property to "Relational" and set the XPath connection property to "/root/people;/root/people/vehicles;/root/people/vehicles/maintenance;" to perform the following query and see the example result set.
The following query explicitly JOINs the people, vehicles, and maintenance tables.
In the example query, each maintenance element is JOINed to its parent vehicle element, which is JOINed to its parent people element to produce a table with 8 rows (2 maintenance entries for each of 2 vehicles each for 2 people).
Using a top-level document view of the data provides ready access to top-level elements. The provider returns nested elements in aggregate, as single columns.
One aspect to consider is performance. You forego the time and resources to process and parse nested elements -- the provider parses the returned data once, using streaming to read the data. Another consideration is your need to access any data stored in nested parent elements, and the ability of your tool or application to process XML.
With DataModel set to "Document" (the default), the provider scans only a single element, the top-level element by default. The top-level elements are available as columns due to default object flattening. Nested elements are returned as aggregated XML.
You can set XPath to specify an element other than the top-level one.
Below is a sample query and the results, based on the sample document in Raw Data. The query results in a single "people" table based on the XPath "/root/people".
Set the DataModel connection property to "Document" to perform the following query and see the example result set. The provider will scan only the XPath value below:
The following query pulls the top-level elements and the subelements of the vehicles element into the results.
With a document view of the data, the personal element is flattened into 4 columns and the source and vehicles elements are returned as individual columns, resulting in a table with 6 columns.
URI=C:\people.txt;DataModel=FlattenedDocuments;XPath='/root/people;/root/people/vehicles;/root/people/vehicles/maintenance;'
SELECT [personal.age]
AS
age, [personal.gender]
AS
gender, [personal.name.first]
AS
name_first, [personal.name.last]
AS
name_last, [source], [type], [model], [insurance.company]
AS
ins_company, [insurance.policy_num]
AS
ins_policy_num, [date]
AS
maint_date, [desc]
AS
maint_descFROM [people]
age
gender
first_name
last_name
source
type
model
ins_company
ins_policy_num
maint_date
maint_desc
20
M
John
Doe
internet
car
Honda Civic
ABC Insurance
12345
2017-07-17
oil change
20
M
John
Doe
internet
car
Honda Civic
ABC Insurance
12345
2018-01-03
new tires
20
M
John
Doe
internet
truck
Dodge Ram
ABC Insurance
12345
2017-08-27
new tires
20
M
John
Doe
internet
truck
Dodge Ram
ABC Insurance
12345
2018-01-08
oil change
24
F
Jane
Roberts
phone
car
Toyota Camry
Car Insurance
98765
2017-05-11
tires rotated
24
F
Jane
Roberts
phone
car
Toyota Camry
Car Insurance
98765
2017-11-03
oil change
24
F
Jane
Roberts
phone
car
Honda Accord
Car Insurance
98765
2017-10-07
new air filter
24
F
Jane
Roberts
phone
car
Honda Accord
Car Insurance
98765
2018-01-13
new brakes
URI=C:\people.txt;DataModel=Relational;XPath='/root/people;/root/people/vehicles;/root/people/vehicles/maintenance;'
SELECT [people].[personal.age]
AS
age, [people].[personal.gender]
AS
gender, [people].[personal.name.first]
AS
first_name, [people].[personal.name.last]
AS
last_name, [people].[source], [vehicles].[type], [vehicles].[model], [vehicles].[insurance.company]
AS
ins_company, [vehicles].[insurance.policy_num]
AS
ins_policy_num, [maintenance].[date]
AS
maint_date, [maintenance].[desc]
AS
maint_descFROM [people]JOIN [vehicles]ON [people].[_id] = [vehicles].[people_id]JOIN [maintenance]ON [vehicles].[_id] = [maintenance].[vehicles_id]
age
gender
first_name
last_name
source
type
model
ins_company
ins_policy_num
maint_date
maint_desc
20
M
John
Doe
internet
car
Honda Civic
ABC Insurance
12345
2017-07-17
oil change
20
M
John
Doe
internet
car
Honda Civic
ABC Insurance
12345
2018-01-03
new tires
20
M
John
Doe
internet
truck
Dodge Ram
ABC Insurance
12345
2017-08-27
new tires
20
M
John
Doe
internet
truck
Dodge Ram
ABC Insurance
12345
2018-01-08
oil change
24
F
Jane
Roberts
phone
car
Toyota Camry
Car Insurance
98765
2017-05-11
tires rotated
24
F
Jane
Roberts
phone
car
Toyota Camry
Car Insurance
98765
2017-11-03
oil change
24
F
Jane
Roberts
phone
car
Honda Accord
Car Insurance
98765
2017-10-07
new air filter
24
F
Jane
Roberts
phone
car
Honda Accord
Car Insurance
98765
2018-01-13
new brakes
URI=C:\people.txt;DataModel=Document;XPath='/root/people';
SELECT [personal.age]
AS
age, [personal.gender]
AS
gender, [personal.name.first]
AS
name_first, [personal.name.last]
AS
name_last, [source], [vehicles]FROM [people]
age
gender
name_first
name_last
source
vehicles
20
M
John
Doe
internet
24
F
Jane
Roberts
phone