Strip out HTML from element

Questions about XML that are not covered by the other forums should go here.
htdub
Posts: 4
Joined: Wed Feb 04, 2009 2:30 am

Strip out HTML from element

Post by htdub »

Hi

I've got a large XML file, that I would like to convert to a csv file.

One of the problem's i'm running into is the XML element is a copy of a HTML email.

How do I grab just the text? Or strip away the html coding before the say the body tag.

Code: Select all

<?xml version="1.0" encoding="utf-8" ?><RegistrantExport>
<Registrant><ProjectID>33</ProjectID><RegistrantNo>6</RegistrantNo><RatingID>132</RatingID><SourceTypeID>260</SourceTypeID><SecondarySourceTypeID></SecondarySourceTypeID><Status>Normal</Status><ExcludeFromTraffic>No</ExcludeFromTraffic><RegistrationDate>2005-09-04</RegistrationDate><EnteredBy></EnteredBy><LastContactDate></LastContactDate><LastContactType></LastContactType><PersonalID>1238599</PersonalID><City>Savona</City><Province>BC</Province><PostalCode>V0K 2J0</PostalCode><Country>Canada</Country><IsPrimary>1</IsPrimary></Address><IsPrimary>1</IsPrimary></Email></Emails><Questions><Question><Title>I heard about Tobiano from</Title><Answers><Answer>Other</Answer></Answers></Question></Questions><History><HistoryEntry><PersonalID>1238599</PersonalID><Project>Tobiano</Project><HistoryType>Mass Mail</HistoryType><SalesRep>Andrew Karpiak</SalesRep><Date>2008-12-08 12:51:06</Date><Subject><![CDATA[Tobiano - wins again!]]></Subject><Body><![CDATA[<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Tobiano | Live, rest, and play</title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<style type="text/css"><!--
body {
margin-left: 0px;
margin-top: 0px;
margin-right: 0px;
margin-bottom: 0px;
background-color: #F6F5F0;
border-top-style: none;
border-right-style: none;
border-bottom-style: none;
border-left-style: none;
}

Thanks
george
Site Admin
Posts: 2095
Joined: Thu Jan 09, 2003 2:58 pm

Re: Strip out HTML from element

Post by george »

The HTML content is in fact just a text inside your Body element. You can use regular expressions if you use XSLT 2.0 otherwise you are left with just substring, substring-before and sunbstring-after to process that.
Another possibility is to create and use an extension function that can do the processing you want.

For the first proposal there is already an XSLT 2.0 stylesheet that does the processing for you, see
http://www.dcarlisle.demon.co.uk/htmlparse.xsl

Regards,
George
George Cristian Bina
Post Reply