Page 1 of 1

Import TEI header information from excel and combine it with

Posted: Tue Jan 27, 2015 9:31 am
by pilar
Hello, I am new to Oxygen and it would be great if you could help me with a corpus I am working on.

I have several hundreds of plain text files. They are short texts (around 200 words each) which make up a corpus (around 700,000 words).

I would like to create an XML version for every file (using a "TEI lite" schema) containing information like the title of the text, the author, date, and so on.

I already have an excel file with the information associated to every text file (one row per text file, columns contain information like title of the text, author, date, etc), like this:

filename title author date
filename1 title1 author1 2013
filename2 title2 author2 2010
filename3 title3 author 3 2011
...

However, the excel file only contains the metadata associated to every text file (and the name of the text file), but not the textfile itself.

I would like to convert the information contained in the excel file into a TEI header, and combine it with the text file, so that I have an XML document with two parts: TEI header and text.

Could you please let me know how I can do that?

I think that I can export the excel information into XML, as explained here: https://www.udemy.com/blog/excel-to-xml/ but I am not sure how to include the text itself into the XML file.

Any help would be very much appreciated.

Pilar

Re: Import TEI header information from excel and combine it

Posted: Tue Jan 27, 2015 12:40 pm
by adrian
Hello,

1. First you must either:
- From Oxygen, import the Excel file into XML (File > Import > MS Excel File) so you can process it with a stylesheet.
or
- From Excel, export it to XML (as explained in the link you found).

2. Then you must write a stylesheet that processes that XML file and looks up each text filename and extracts the text as it is (unparsed-text()) from each text file.
After you decided on the structure of the imported XML file, please post a snippet of that XML, since the stylesheet must be tailored for that XML structure.

Regards,
Adrian

Re: Import TEI header information from excel and combine it

Posted: Thu Jan 29, 2015 10:44 am
by pilar
Hello again,

Thank you for the quick reply.

I have imported the information from the excel file to the XML editor, and now I have a file like the following, with several hundreds <doc> sections.

Code: Select all


<?xml version="1.0" encoding="UTF-8"?>
<corpus>
<doc>
<Lines>4</Lines>
<Words>30</Words>
<Characters>192</Characters>
<Filename>3-11-no-nukes.blogspot.jp-2012-04-el-gobierno-les-abandono-los-ninos.txt</Filename>
<Blog_URL>3-11-no-nukes.blogspot.jp</Blog_URL>
<Year>2012</Year>
<Month>4</Month>
<Day>0</Day>
<Date>2012/4/0</Date>
<Post_title_in_URL>el gobierno les abandono los ninos</Post_title_in_URL>
<Blog_title>¿Energía nuclear? No Gracias.</Blog_title>
<Blog_topic>Fukushima</Blog_topic>
<Author_ID>1</Author_ID>
<Nickname>Amor y Paz</Nickname>
<Gender>Female</Gender>
<City>0</City>
<Country>Spain</Country>
<Recent_activity>no</Recent_activity>
<User_profile_text>Soy Japonesa. Vivo en España. ¿Quieres energía nuclear? No, gracias.</User_profile_text>
<User_profile_URL>https://www.blogger.com/profile/00473445104589256113</User_profile_URL>
</doc>
What I would like to do now is that every <doc> section will have two subsections:

<header> which contains the 20 fields I have imported from excel.
<text> which contains the text extracted from each file (the name of the file is in the "Filename" field).

Do you know how I can do this?

Pilar

Re: Import TEI header information from excel and combine it

Posted: Thu Jan 29, 2015 11:51 am
by adrian
Hi,

I have used as the starting point the copy stylesheet from the Oxygen samples: samples/xhtml/copy.xsl
I then added a template for the "doc" element that treats it like you described:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<!-- Match document -->
<xsl:template match="/">
<xsl:apply-templates mode="copy" select="."/>
</xsl:template>
<!-- Deep copy template -->
<xsl:template match="*|text()|@*" mode="copy">
<xsl:copy>
<xsl:apply-templates mode="copy" select="@*"/>
<xsl:apply-templates mode="copy"/>
</xsl:copy>
</xsl:template>

<!-- Handle default matching -->
<xsl:template match="*"/>

<xsl:template match="doc" mode="copy">
<doc>
<header>
<xsl:apply-templates mode="copy"/>
</header>
<xsl:if test="unparsed-text-available(Filename)">
<text>
<xsl:value-of select="unparsed-text(Filename)"/>
</text>
</xsl:if>
</doc>
</xsl:template>
</xsl:stylesheet>
To apply this in Oxygen:
1. First create the XSL file with the contents I provided above.
2. Open/select the XML file and from the main menu invoke Document > Transformation > Configure Transformation Scenario (there's a corresponding action in the toolbar)
3. Press 'New' to create a new scenario
4. Give the scenario an appropriate name
5. Leave the 'XML URL' field to its default(${currentFileURL})
6. In the 'XSL URL' field pick the XSL file.
7. From the Transformer combo choose Saxon-PE(needed for XSL 2.0)
8. You can further tune the 'Output' and use editor variables to specify the path and name.
e.g in the 'Save as' field you can specify: ${cfd}/${cfn}-out.xml which translates into <current-file-directory>/<current-filename>-out.xml
9. With the XML file selected, run the transformation: Document > Transformation > Apply Transformation Scenario (there's a corresponding action in the toolbar)

Regards,
Adrian

Re: Import TEI header information from excel and combine it

Posted: Thu Jan 29, 2015 1:20 pm
by pilar
Great, it works perfect. Thank you so much for your explanations and quick reply!

Pilar