The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Help with xml, xpath, namespaces.
cindyharper
Member Posts: 9 Contributor II
Below is sample XML from GoogleCSE API:
<?xml version="1.0" encoding="UTF-8"?>
<feed gd:kind="customsearch#search" xmlns="http://www.w3.org/2005/Atom" xmlns:cse="http://schemas.google.com/cseapi/2010" xmlns:gd="http://schemas.google.com/g/2005" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
<title>Google Custom Search - Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu</title>
<id>tag:www.googleapis.com,2010-09-29:/customsearch/v1?q= Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu&cx=008033228147187897025:-ua_scxr1uc&num=7&start=1&safe=off</id>
<author>
<name>Library Website Search Engine - Google Custom Search</name>
</author>
<updated>1970-01-16T11:10:30.455Z</updated>
<opensearch:Url type="application/atom+xml" template="https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={cse:safe?}&cx={cse:cx?}&cref={cse:cref?}&sort={cse:sort?}&filter={cse:filter?}&gl={cse:gl?}&cr={cse:cr?}}&googlehost={cse:googleHost?}&c2coff={?cse:disableCnTwTranslation}&hq={cse:hq?}&hl={cse:hl?}&siteSearch={cse:siteSearch?}&siteSearchFilter={cse:siteSearchFilter?}&exactTerms={cse:exactTerms?}&excludeTerms={cse:excludeTerms?}&linkSite={cse:linkSite?}&orTerms={cse:orTerms?}&relatedSite={cse:relatedSite?}&dateRestrict={cse:dateRestrict?}&lowRange={cse:lowRange?}&highRange={cse:highRange?}&searchType={cse:searchType?}&fileType={cse:fileType?}&rights={cse:rights?}&imgsz={cse:imgsz?}&imgtype={cse:imgtype?}&imgc={cse:imgc?}&imgcolor={cse:imgcolor?}&alt=atom"/>
<opensearch:Query role="request" title="Google Custom Search - Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu" totalResults="7" searchTerms=" Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu" count="7" startIndex="1" inputEncoding="utf8" outputEncoding="utf8" cse:safe="off" cse:cx="008033228147187897025:-ua_scxr1uc"/>
<opensearch:totalResults>7</opensearch:totalResults>
<opensearch:startIndex>1</opensearch:startIndex>
<cse:context title="Library Website Search Engine"/>
<cse:searchInformation>
<cse:searchTime>0.073074</cse:searchTime>
<cse:formattedSearchTime>0.07</cse:formattedSearchTime>
<cse:totalResults>7</cse:totalResults>
<cse:formattedTotalResults>7</cse:formattedTotalResults>
</cse:searchInformation>
<cse:spelling>
<cse:correctedQuery type="html"/>
</cse:spelling>
<entry gd:kind="customsearch#result">
<id>http://www.albertus.edu/policy-reports/advancement-publications/documents/albertus-archive-october-2011-special-edition.pdf</id>
<updated>1970-01-16T11:10:30.455Z</updated>
<title type="html">Special Edition Athletics @lbertus <b>Newsletter</b></title>
<link href="http://www.albertus.edu/policy-reports/advancement-publications/documents/albertus-archive-october-2011-special-edition.pdf" title="www.albertus.edu"/>
<summary type="html">This weekend marks a busy and historic time on campus for the <b>Albertus</b>. <br> <b>Magnus College</b> Athletics Department as both the men&#39;s and women&#39;s soccer <b>...</b></summary>
<cse:cacheId>AJGUZgC9CVMJ</cse:cacheId>
<cse:mime>application/pdf</cse:mime>
<cse:fileFormat>PDF/Adobe Acrobat</cse:fileFormat>
<cse:formattedUrl type="html">www.<b>albertus.edu</b>/.../<b>albertus</b>-archive-october-2011-special-edition.pdf</cse:formattedUrl>
<cse:PageMap>
<cse:DataObject type="metatags">
<cse:Attribute name="creationdate" value="D:20111118135759-05'00'"/>
<cse:Attribute name="producer" value="Acrobat Web Capture 8.0"/>
<cse:Attribute name="moddate" value="D:20111118140743-05'00'"/>
<cse:Attribute name="title" value="Special Edition Athletics @lbertus Newsletter"/>
</cse:DataObject>
</cse:PageMap>
</entry>
...
</feed>
I'm using Generate Extract operator. I've specified the namespaces as:
<list key="namespaces">
<parameter key="x" value="http://www.kbcafe.com/rss/atom.xsd.xml"/>
<parameter key="xmlns:cse" value="http://schemas.google.com/cseapi/2010"/>
<parameter key="xmlns:gd" value="http://schemas.google.com/g/2005"/>
<parameter key="xmlns:opensearch" value="http://a9.com/-/spec/opensearch/1.1/"/>
<parameter key="xx" value="xml"/>
</list>
I've tried to extract xpath such as
//x:feed
//feed
and more specific - can't seem to match anyhting in ths feed. I'm sure the problem is in my namespaces, but I don't know where to go to find the answer.
The targets I want to extract are
//x:feed/x:entry/x:title
and //x:feed/x:entry/x:link/@href.
Any help would be appreciated.
<?xml version="1.0" encoding="UTF-8"?>
<feed gd:kind="customsearch#search" xmlns="http://www.w3.org/2005/Atom" xmlns:cse="http://schemas.google.com/cseapi/2010" xmlns:gd="http://schemas.google.com/g/2005" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
<title>Google Custom Search - Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu</title>
<id>tag:www.googleapis.com,2010-09-29:/customsearch/v1?q= Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu&cx=008033228147187897025:-ua_scxr1uc&num=7&start=1&safe=off</id>
<author>
<name>Library Website Search Engine - Google Custom Search</name>
</author>
<updated>1970-01-16T11:10:30.455Z</updated>
<opensearch:Url type="application/atom+xml" template="https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={cse:safe?}&cx={cse:cx?}&cref={cse:cref?}&sort={cse:sort?}&filter={cse:filter?}&gl={cse:gl?}&cr={cse:cr?}}&googlehost={cse:googleHost?}&c2coff={?cse:disableCnTwTranslation}&hq={cse:hq?}&hl={cse:hl?}&siteSearch={cse:siteSearch?}&siteSearchFilter={cse:siteSearchFilter?}&exactTerms={cse:exactTerms?}&excludeTerms={cse:excludeTerms?}&linkSite={cse:linkSite?}&orTerms={cse:orTerms?}&relatedSite={cse:relatedSite?}&dateRestrict={cse:dateRestrict?}&lowRange={cse:lowRange?}&highRange={cse:highRange?}&searchType={cse:searchType?}&fileType={cse:fileType?}&rights={cse:rights?}&imgsz={cse:imgsz?}&imgtype={cse:imgtype?}&imgc={cse:imgc?}&imgcolor={cse:imgcolor?}&alt=atom"/>
<opensearch:Query role="request" title="Google Custom Search - Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu" totalResults="7" searchTerms=" Albertus Magnus College. library Albertus Magnus College Library intitle:newsletter albertus.edu" count="7" startIndex="1" inputEncoding="utf8" outputEncoding="utf8" cse:safe="off" cse:cx="008033228147187897025:-ua_scxr1uc"/>
<opensearch:totalResults>7</opensearch:totalResults>
<opensearch:startIndex>1</opensearch:startIndex>
<cse:context title="Library Website Search Engine"/>
<cse:searchInformation>
<cse:searchTime>0.073074</cse:searchTime>
<cse:formattedSearchTime>0.07</cse:formattedSearchTime>
<cse:totalResults>7</cse:totalResults>
<cse:formattedTotalResults>7</cse:formattedTotalResults>
</cse:searchInformation>
<cse:spelling>
<cse:correctedQuery type="html"/>
</cse:spelling>
<entry gd:kind="customsearch#result">
<id>http://www.albertus.edu/policy-reports/advancement-publications/documents/albertus-archive-october-2011-special-edition.pdf</id>
<updated>1970-01-16T11:10:30.455Z</updated>
<title type="html">Special Edition Athletics @lbertus <b>Newsletter</b></title>
<link href="http://www.albertus.edu/policy-reports/advancement-publications/documents/albertus-archive-october-2011-special-edition.pdf" title="www.albertus.edu"/>
<summary type="html">This weekend marks a busy and historic time on campus for the <b>Albertus</b>. <br> <b>Magnus College</b> Athletics Department as both the men&#39;s and women&#39;s soccer <b>...</b></summary>
<cse:cacheId>AJGUZgC9CVMJ</cse:cacheId>
<cse:mime>application/pdf</cse:mime>
<cse:fileFormat>PDF/Adobe Acrobat</cse:fileFormat>
<cse:formattedUrl type="html">www.<b>albertus.edu</b>/.../<b>albertus</b>-archive-october-2011-special-edition.pdf</cse:formattedUrl>
<cse:PageMap>
<cse:DataObject type="metatags">
<cse:Attribute name="creationdate" value="D:20111118135759-05'00'"/>
<cse:Attribute name="producer" value="Acrobat Web Capture 8.0"/>
<cse:Attribute name="moddate" value="D:20111118140743-05'00'"/>
<cse:Attribute name="title" value="Special Edition Athletics @lbertus Newsletter"/>
</cse:DataObject>
</cse:PageMap>
</entry>
...
</feed>
I'm using Generate Extract operator. I've specified the namespaces as:
<list key="namespaces">
<parameter key="x" value="http://www.kbcafe.com/rss/atom.xsd.xml"/>
<parameter key="xmlns:cse" value="http://schemas.google.com/cseapi/2010"/>
<parameter key="xmlns:gd" value="http://schemas.google.com/g/2005"/>
<parameter key="xmlns:opensearch" value="http://a9.com/-/spec/opensearch/1.1/"/>
<parameter key="xx" value="xml"/>
</list>
I've tried to extract xpath such as
//x:feed
//feed
and more specific - can't seem to match anyhting in ths feed. I'm sure the problem is in my namespaces, but I don't know where to go to find the answer.
The targets I want to extract are
//x:feed/x:entry/x:title
and //x:feed/x:entry/x:link/@href.
Any help would be appreciated.
0
Answers
how are you trying to extract XPaths? Your current process setup and maybe some sample data would be useful to write a well-founded answer.
Best,
Marius
My latest attempt was to try to take the import statements out of both the GooglePage attribute ( see Replace operator), and out of the .xsd. So the xsd looks like this: I wasn't able to follow the import xsd links from the google output in my browser, so that's why I decided to try to dispense with them.
please have a look at the attached process. The trick is to prepend //entry with "atom:" like this: //atom:entry and to define the atom prefix in the namespaces parameter exactly as it is written in the xml data.
Best,
Marius