Page 1 of 1

webhelp outputs include content that should be filtered out

Posted: Wed Mar 02, 2016 5:54 pm
by damienr
Hi,

webhelp includes content that should have been filtered out. No links to the content are available from the TOC, however the search engine detects and indexes the files and hence makes them accessible to customers through search.

Content inclusion is conditional: distribution=Internal_OBSOLETE

Is there a way to notgenerate an output for those topics ?

Thanks

Damien

Re: webhelp outputs include content that should be filtered out

Posted: Thu Mar 03, 2016 10:49 am
by bogdan_cercelaru
Hello,

If you are publishing WebHelp into a folder that contains other HTML files, all these files will be indexed (files that have the same extension as the one set using the "args.outext" parameter) and presented in the search results. The indexer runs on all HTML files from the output folder, without considering if topic is linked in the map.

To clean the output directory before WebHelp is generated, you can use the "clean.output" parameter and set it to "yes".

Regards,
Bogdan

Re: webhelp outputs include content that should be filtered out

Posted: Thu Mar 03, 2016 3:59 pm
by damienr
This is not the problem.
The problem is that some dita content that should not be part of the output (cause they are excluded by the value set in the ditaval file) are present in the generated webhelp. Then they are indexed and present in the search results.

Damien

Re: webhelp outputs include content that should be filtered out

Posted: Fri Mar 04, 2016 1:34 pm
by Radu
Hi Damien,

If you remove the output and temporary files folder manually before publishing, do you still obtain those extra HTML files coming from the extra topics which should be excluded?
Maybe you excluded the topics in the DITA Map but there may be other topics present in the publication linking to them. And you should also profile those links.
If you run the "Validate and check for completness" action from the DITA Maps Manager, you can run it with a DITAVAL filter configured. It might report that certain topics are not referenced in the DITA Map but there are links to them from other topics present in the DITA Map.

Regards,
Radu

Re: webhelp outputs include content that should be filtered out

Posted: Fri Mar 04, 2016 3:10 pm
by Pascale
Hi Radu,

I can confirm that
- the book is valid when using the Oxygen check completeness and the DITAVAL
- the profiled content is excluded from the PDF
- the profiled content is excluded from the TOC and next/previous page navigation
thus the problem is due to the fact that, despite it should be excluded, the topic is converted to HTML and then indexed.

Pascale

Re: webhelp outputs include content that should be filtered out

Posted: Fri Mar 04, 2016 3:16 pm
by Radu
Hi Pascale,

Do you have the same problem when converting to XHTML?
You could try to make an experiment, for example make one of the topics which should not be reachable anymore using the DITAVAL filter not wellformed (remove the root start tag for example).
Then configure Oxygen in the Preferences->DITA page to always show the console output and publish to WebHelp, after that you can look in the DITA OT console view to see in what part of the processing the DITA OT will try to read it. It might give us some indication about the problem.

Also, if you can put together a sample DITA project to reproduce the problem and attach it to an email, we could try to investigate this on our side.

Regards,
Radu

Re: webhelp outputs include content that should be filtered out

Posted: Fri Mar 04, 2016 6:58 pm
by Pascale
Hi Radu,

we found the cause of the problem: we are using DITA OT 2.2.2 (not 1.8 ) and this version seems to have severe problems with the XML Catalogs.

In many occasions, and in particular when executing the [gen-list] and [debug-filter] targets, OT issues messages like:

Code: Select all


Failed to read DITAVAL file: ...\ditaval.dtd (The system cannot find the file specified)
[DOTJ037W][WARN] The XML schema and DTD validation function of the parser is turned off. Please make sure ....
Using Xerces grammar pool for DTD and schema caching.


Our workaround is to copy the DTD where it is needed, by adding to the plugin:

Code: Select all

<feature extension="depend.preprocess.pre" value="copy-ditaval-dtd"/>
and by defining a new ANT target in the buildxxx.xml file:

Code: Select all


	<target name="copy-ditaval-dtd" description="Copy DITAVAL DTD">
<dirname property="ditaval.dir" file="${dita.input.valfile}" />
<copy file="${basedir}/plugins/org.oasis-open.dita.v1_2/dtd/ditaval/dtd/ditaval.dtd" todir="${ditaval.dir}" />
</target>
With that, the filtering occurs as expected: the excluded content is not present anymore, and the search behaves correctly.

Kind regards,
Pascale

Re: webhelp outputs include content that should be filtered out

Posted: Mon Mar 07, 2016 9:17 am
by Radu
Hi Pascale,

It's good you found the problem. I'm assuming the referenced DITAVAL DTD is in the proper location (relative to the DITAVAL file) somewhere in the sources folder?
But then probably the DITA OT does not copy it to the temporary files folder and thus the parsing problem arises. Whenever an XML document with an associated DTD is properly parsed, that DTD needs to be found and resolved otherwise the parsing fails.
I guess as a workaround you could remove that DOCTYPE declaration entirely. Oxygen will still validate the DITAVAL when it's opened.
If you want you can also try to register a bug on the DITA OT issues list.

Regards,
Radu