Page 1 of 1

Scripting to semi-automate entity generation?

Posted: Thu Jul 23, 2015 4:56 am
by jreifste
Hello,

I apologize in advance if I do not use the correct language for my problem. I am very new to XML and this is my first project.

I'm currently using XML and the TEI P5 guidelines to encode a number of public domain theatrical plays for an academic project. I must generate entities for every speaker, line, as well as physical position of each line as it corresponds to a physical publication of these plays. The current play that I am working on is over 5,000 lines long and physically typing in the entities for each line, scene, speaker, etc. has been exhausting and very inefficient.

I'm curious if there is a way I can automate this process. I'd like to automate the following:

-Insert <l>.........</l> for every line in a document (I paste all of the text from the original document into Oxygen and have approx. 5,000 lines of text)
-Within the text, if a word is capitalized without a period immediately before it, place that word on the next line.
-after X number of lines, insert <page number and page image>

I would appreciate any help the community could provide and am happy to provide additional information if necessary. I have limited scripting and programming experience, but if I'm pointed in the right direction I believe I can figure it out.

Thank you

Re: Scripting to semi-automate entity generation?

Posted: Mon Jul 27, 2015 3:38 pm
by adrian
Hello,

If the source is text (as opposed to XML), you can't really automate the process, but you can make use of various Oxygen helpers to improve productivity.
I'm currently using XML and the TEI P5 guidelines to encode a number of public domain theatrical plays for an academic project. I must generate entities for every speaker, line, as well as physical position of each line as it corresponds to a physical publication of these plays. The current play that I am working on is over 5,000 lines long and physically typing in the entities for each line, scene, speaker, etc. has been exhausting and very inefficient.
I believe that by "entities" you are referring to XML tags, start tag (<tag>) and end tag (</tag>) of an XML element. Note that in XML the term "entities" usually refers to XML entities.

If you have certain XML structures that appear repeatedly, you can create code templates in Oxygen that insert (or surround existing text content with) that XML snippet.

Most of what you mentioned could be accomplished with the Find/Replace tool (Find > Find/Replace) and regular expressions, but unfortunately this can't be automated by Oxygen. Still, you could accomplish all this with a few manual steps, and afterwards just make corrections.
jreifste wrote:-Insert <l>.........</l> for every line in a document (I paste all of the text from the original document into Oxygen and have approx. 5,000 lines of text)
You could wrap every line within <l> tags with the Find/Replace tool (Find > Find/Replace) and regular expressions:
Find: ^.*?$
Replace with: <l>$0</l>
Options: Regular expression
Note that this doesn't check if lines are already wrapped in <l> tags, so only do this once.
jreifste wrote:-Within the text, if a word is capitalized without a period immediately before it, place that word on the next line.
I have a similar solution with the Find/Replace tool (Find > Find/Replace) and regular expressions:
Find: (?<!\.)[A-Z].+?
Replace with: \n$0
Options:
Case sensitive
Regular expression
Note that this simply breaks the text line, if you also want it to break an existing <l> tag, you can instead use Replace with: </l>\n<l>$0
jreifste wrote:-after X number of lines, insert <page number and page image>
This assumes the lines are already wrapped in <l> tags:
Find: (<l>.*?</l>\n){X} (replace X with the actual number)
Replace with: $0<pb n="1" facs="page1.png"/>\n
Options: Regular expression
Note that if you use the Find and Replace buttons this counts the X number of lines from your current position in the document, so make sure you start at the top, or you can use 'Find All' and/or 'Replace All' which always start at the top.

Regards,
Adrian