Strip out HTML from element

Post by **htdub** » Tue Feb 10, 2009 2:02 am

Hi

I've got a large XML file, that I would like to convert to a csv file.

One of the problem's i'm running into is the XML element is a copy of a HTML email.

How do I grab just the text? Or strip away the html coding before the say the body tag.

Code: Select all

<?xml version="1.0" encoding="utf-8" ?><RegistrantExport>

<Registrant><ProjectID>33</ProjectID><RegistrantNo>6</RegistrantNo><RatingID>132</RatingID><SourceTypeID>260</SourceTypeID><SecondarySourceTypeID></SecondarySourceTypeID><Status>Normal</Status><ExcludeFromTraffic>No</ExcludeFromTraffic><RegistrationDate>2005-09-04</RegistrationDate><EnteredBy></EnteredBy><LastContactDate></LastContactDate><LastContactType></LastContactType><PersonalID>1238599</PersonalID><City>Savona</City><Province>BC</Province><PostalCode>V0K 2J0</PostalCode><Country>Canada</Country><IsPrimary>1</IsPrimary></Address><IsPrimary>1</IsPrimary></Email></Emails><Questions><Question><Title>I heard about Tobiano from</Title><Answers><Answer>Other</Answer></Answers></Question></Questions><History><HistoryEntry><PersonalID>1238599</PersonalID><Project>Tobiano</Project><HistoryType>Mass Mail</HistoryType><SalesRep>Andrew Karpiak</SalesRep><Date>2008-12-08 12:51:06</Date><Subject><![CDATA[Tobiano - wins again!]]></Subject><Body><![CDATA[<?xml version="1.0" encoding="iso-8859-1"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

<head>

<title>Tobiano | Live, rest, and play</title>

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

<style type="text/css"><!--

body {

margin-left: 0px;

margin-top: 0px;

margin-right: 0px;

margin-bottom: 0px;

background-color: #F6F5F0;

border-top-style: none;

border-right-style: none;

border-bottom-style: none;

border-left-style: none;

}

Thanks

Post by **george** » Tue Feb 10, 2009 6:51 pm

The HTML content is in fact just a text inside your Body element. You can use regular expressions if you use XSLT 2.0 otherwise you are left with just substring, substring-before and sunbstring-after to process that.
Another possibility is to create and use an extension function that can do the processing you want.

For the first proposal there is already an XSLT 2.0 stylesheet that does the processing for you, see
http://www.dcarlisle.demon.co.uk/htmlparse.xsl

Regards,
George

Strip out HTML from element

Strip out HTML from element

Re: Strip out HTML from element