Import TEI header information from excel and combine it with
Having trouble installing Oxygen? Got a bug to report? Post it all here.
-
- Posts: 3
- Joined: Tue Jan 27, 2015 9:11 am
Import TEI header information from excel and combine it with
Hello, I am new to Oxygen and it would be great if you could help me with a corpus I am working on.
I have several hundreds of plain text files. They are short texts (around 200 words each) which make up a corpus (around 700,000 words).
I would like to create an XML version for every file (using a "TEI lite" schema) containing information like the title of the text, the author, date, and so on.
I already have an excel file with the information associated to every text file (one row per text file, columns contain information like title of the text, author, date, etc), like this:
filename title author date
filename1 title1 author1 2013
filename2 title2 author2 2010
filename3 title3 author 3 2011
...
However, the excel file only contains the metadata associated to every text file (and the name of the text file), but not the textfile itself.
I would like to convert the information contained in the excel file into a TEI header, and combine it with the text file, so that I have an XML document with two parts: TEI header and text.
Could you please let me know how I can do that?
I think that I can export the excel information into XML, as explained here: https://www.udemy.com/blog/excel-to-xml/ but I am not sure how to include the text itself into the XML file.
Any help would be very much appreciated.
Pilar
I have several hundreds of plain text files. They are short texts (around 200 words each) which make up a corpus (around 700,000 words).
I would like to create an XML version for every file (using a "TEI lite" schema) containing information like the title of the text, the author, date, and so on.
I already have an excel file with the information associated to every text file (one row per text file, columns contain information like title of the text, author, date, etc), like this:
filename title author date
filename1 title1 author1 2013
filename2 title2 author2 2010
filename3 title3 author 3 2011
...
However, the excel file only contains the metadata associated to every text file (and the name of the text file), but not the textfile itself.
I would like to convert the information contained in the excel file into a TEI header, and combine it with the text file, so that I have an XML document with two parts: TEI header and text.
Could you please let me know how I can do that?
I think that I can export the excel information into XML, as explained here: https://www.udemy.com/blog/excel-to-xml/ but I am not sure how to include the text itself into the XML file.
Any help would be very much appreciated.
Pilar
-
- Posts: 2879
- Joined: Tue May 17, 2005 4:01 pm
Re: Import TEI header information from excel and combine it
Hello,
1. First you must either:
- From Oxygen, import the Excel file into XML (File > Import > MS Excel File) so you can process it with a stylesheet.
or
- From Excel, export it to XML (as explained in the link you found).
2. Then you must write a stylesheet that processes that XML file and looks up each text filename and extracts the text as it is (unparsed-text()) from each text file.
After you decided on the structure of the imported XML file, please post a snippet of that XML, since the stylesheet must be tailored for that XML structure.
Regards,
Adrian
1. First you must either:
- From Oxygen, import the Excel file into XML (File > Import > MS Excel File) so you can process it with a stylesheet.
or
- From Excel, export it to XML (as explained in the link you found).
2. Then you must write a stylesheet that processes that XML file and looks up each text filename and extracts the text as it is (unparsed-text()) from each text file.
After you decided on the structure of the imported XML file, please post a snippet of that XML, since the stylesheet must be tailored for that XML structure.
Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
-
- Posts: 3
- Joined: Tue Jan 27, 2015 9:11 am
Re: Import TEI header information from excel and combine it
Hello again,
Thank you for the quick reply.
I have imported the information from the excel file to the XML editor, and now I have a file like the following, with several hundreds <doc> sections.
What I would like to do now is that every <doc> section will have two subsections:
<header> which contains the 20 fields I have imported from excel.
<text> which contains the text extracted from each file (the name of the file is in the "Filename" field).
Do you know how I can do this?
Pilar
Thank you for the quick reply.
I have imported the information from the excel file to the XML editor, and now I have a file like the following, with several hundreds <doc> sections.
Code: Select all
<?xml version="1.0" encoding="UTF-8"?>
<corpus>
<doc>
<Lines>4</Lines>
<Words>30</Words>
<Characters>192</Characters>
<Filename>3-11-no-nukes.blogspot.jp-2012-04-el-gobierno-les-abandono-los-ninos.txt</Filename>
<Blog_URL>3-11-no-nukes.blogspot.jp</Blog_URL>
<Year>2012</Year>
<Month>4</Month>
<Day>0</Day>
<Date>2012/4/0</Date>
<Post_title_in_URL>el gobierno les abandono los ninos</Post_title_in_URL>
<Blog_title>¿Energía nuclear? No Gracias.</Blog_title>
<Blog_topic>Fukushima</Blog_topic>
<Author_ID>1</Author_ID>
<Nickname>Amor y Paz</Nickname>
<Gender>Female</Gender>
<City>0</City>
<Country>Spain</Country>
<Recent_activity>no</Recent_activity>
<User_profile_text>Soy Japonesa. Vivo en España. ¿Quieres energía nuclear? No, gracias.</User_profile_text>
<User_profile_URL>https://www.blogger.com/profile/00473445104589256113</User_profile_URL>
</doc>
<header> which contains the 20 fields I have imported from excel.
<text> which contains the text extracted from each file (the name of the file is in the "Filename" field).
Do you know how I can do this?
Pilar
-
- Posts: 2879
- Joined: Tue May 17, 2005 4:01 pm
Re: Import TEI header information from excel and combine it
Hi,
I have used as the starting point the copy stylesheet from the Oxygen samples: samples/xhtml/copy.xsl
I then added a template for the "doc" element that treats it like you described:
To apply this in Oxygen:
1. First create the XSL file with the contents I provided above.
2. Open/select the XML file and from the main menu invoke Document > Transformation > Configure Transformation Scenario (there's a corresponding action in the toolbar)
3. Press 'New' to create a new scenario
4. Give the scenario an appropriate name
5. Leave the 'XML URL' field to its default(${currentFileURL})
6. In the 'XSL URL' field pick the XSL file.
7. From the Transformer combo choose Saxon-PE(needed for XSL 2.0)
8. You can further tune the 'Output' and use editor variables to specify the path and name.
e.g in the 'Save as' field you can specify: ${cfd}/${cfn}-out.xml which translates into <current-file-directory>/<current-filename>-out.xml
9. With the XML file selected, run the transformation: Document > Transformation > Apply Transformation Scenario (there's a corresponding action in the toolbar)
Regards,
Adrian
I have used as the starting point the copy stylesheet from the Oxygen samples: samples/xhtml/copy.xsl
I then added a template for the "doc" element that treats it like you described:
Code: Select all
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<!-- Match document -->
<xsl:template match="/">
<xsl:apply-templates mode="copy" select="."/>
</xsl:template>
<!-- Deep copy template -->
<xsl:template match="*|text()|@*" mode="copy">
<xsl:copy>
<xsl:apply-templates mode="copy" select="@*"/>
<xsl:apply-templates mode="copy"/>
</xsl:copy>
</xsl:template>
<!-- Handle default matching -->
<xsl:template match="*"/>
<xsl:template match="doc" mode="copy">
<doc>
<header>
<xsl:apply-templates mode="copy"/>
</header>
<xsl:if test="unparsed-text-available(Filename)">
<text>
<xsl:value-of select="unparsed-text(Filename)"/>
</text>
</xsl:if>
</doc>
</xsl:template>
</xsl:stylesheet>
1. First create the XSL file with the contents I provided above.
2. Open/select the XML file and from the main menu invoke Document > Transformation > Configure Transformation Scenario (there's a corresponding action in the toolbar)
3. Press 'New' to create a new scenario
4. Give the scenario an appropriate name
5. Leave the 'XML URL' field to its default(${currentFileURL})
6. In the 'XSL URL' field pick the XSL file.
7. From the Transformer combo choose Saxon-PE(needed for XSL 2.0)
8. You can further tune the 'Output' and use editor variables to specify the path and name.
e.g in the 'Save as' field you can specify: ${cfd}/${cfn}-out.xml which translates into <current-file-directory>/<current-filename>-out.xml
9. With the XML file selected, run the transformation: Document > Transformation > Apply Transformation Scenario (there's a corresponding action in the toolbar)
Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Jump to
- Oxygen XML Editor/Author/Developer
- ↳ Feature Request
- ↳ Common Problems
- ↳ DITA (Editing and Publishing DITA Content)
- ↳ Artificial Intelligence (AI Positron Assistant add-on)
- ↳ SDK-API, Frameworks - Document Types
- ↳ DocBook
- ↳ TEI
- ↳ XHTML
- ↳ Other Issues
- Oxygen XML Web Author
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Content Fusion
- ↳ Feature Request
- ↳ Common Problems
- Oxygen JSON Editor
- ↳ Feature Request
- ↳ Common Problems
- Oxygen PDF Chemistry
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Feedback
- ↳ Feature Request
- ↳ Common Problems
- Oxygen XML WebHelp
- ↳ Feature Request
- ↳ Common Problems
- XML
- ↳ General XML Questions
- ↳ XSLT and FOP
- ↳ XML Schemas
- ↳ XQuery
- NVDL
- ↳ General NVDL Issues
- ↳ oNVDL Related Issues
- XML Services Market
- ↳ Offer a Service