Content index to document index

Having trouble installing <oXygen/>? Got a bug to report? Post it all here.
sijomon
Posts: 83
Joined: Wed May 20, 2009 1:18 pm

Content index to document index

Post by sijomon » Mon Nov 16, 2009 3:14 pm

Hi,

I'm pretty sure I've seen this issue covered, at least partially, in another post I read sometime back, but I can't find it, so sorry if I'm duplicating stuff here.

I and builidng an operation which shows the user a list of all tags of a certain type within the current document, and allows them to highligth the tags by clicking in the list. The list should also contain section of text which look liek they should be marked up with the tag. These section of text have been specified bya regular expression. I am currently extract the content of the doument as follows:

Code: Select all

authorAccess.getDocumentController().getText(0, authorAccess.getDocumentController().getTextContentLength());
I can then run the rexex over this, and get a number of matches, with there start and end indexx into the content string.

What I can't do is convert this index into a index into the author document. As far as I recall the index into the author document is exssentially the same as that into the conetnt, with +1 for each tag. I really don't won't to have to construct this index by parsing the document node tree; is there an automated way to convert from a content index to a n author document index?

Thanks,

Simon.

sorin_ristache
Posts: 4144
Joined: Fri Mar 28, 2003 2:12 pm

Re: Content index to document index

Post by sorin_ristache » Mon Nov 16, 2009 4:40 pm

Hello,

As you can read in the javadoc of Author API the method AuthorDocumentController.getTextContentLength() is deprecated. You should find the list of elements that have the same tag name for example TAG with AuthorDocumentController.findNodesByXPath("//TAG", true, true, true) that returns an array AuthorNode[]. You get the start index and end index of every AuthorNode using AuthorNode.getStartOffset() and AuthorNode.getEndOffset().


Regards,
Sorin

sijomon
Posts: 83
Joined: Wed May 20, 2009 1:18 pm

Re: Content index to document index

Post by sijomon » Mon Nov 16, 2009 5:18 pm

I want to search across all text within the document, and don't know what node migth contain the matches I'm interested in. For example, say I am searchign for URLs, I want to find all text in the document that looks like a URL, regardless of where in the document that text occurs. My knowledge of XPath is pretty sketchy, can I use XPath to identify nodes which contain text which matches a certain regex? If so I can use the method you indicate, if not, have you any other suggestions?

sijomon
Posts: 83
Joined: Wed May 20, 2009 1:18 pm

Re: Content index to document index

Post by sijomon » Mon Nov 16, 2009 5:49 pm

Think I can use XPath.

A bit of research, and it appears the xpath expression:

Code: Select all

//text()[matches(., "<REGEX>")]
Will identify all text node which match the regex. Then I can grab the offset of the text node start, using AuthorNode.getStartOffset(), then I can run the text noide's content through the same regex in java to get the offset of the start of the match, add this to the node offest and I should get the document offset.

I think that will work.

sorin_ristache
Posts: 4144
Joined: Fri Mar 28, 2003 2:12 pm

Re: Content index to document index

Post by sorin_ristache » Mon Nov 16, 2009 6:20 pm

I am not sure that will work because matches() is an XSLT function not an XPath function. I think you will have to go through all elements or all nodes with //* or //text() and check if the content matches your regex.


Regards,
Sorin

sijomon
Posts: 83
Joined: Wed May 20, 2009 1:18 pm

Re: Content index to document index

Post by sijomon » Tue Nov 17, 2009 1:07 pm

Hi,

I believe matches is part of XPath 2.0

http://www.w3.org/TR/xpath-functions/#func-matches

Either way, it does work as an XPath expression in Oxygen, and I am now succesfully identiying the start and end of the mathces in the document.

Many Thanks,

Simon.

Post Reply