Position of illegal characters not reported

Oxygen general issues.
csalsa
Posts: 97
Joined: Tue Apr 22, 2008 9:31 am

Position of illegal characters not reported

Post by csalsa »

Hi

Today I tried to load a large XML data file (about 2.5MBytes). Unfortunately, it should have only contained ASCII characters - but did not. It also had 36 UTF-8 characters. If I set the XML declaration encoding to "ascii", I would get an error stating there are illegal characters in the file but it did not list the position of these characters.

In the end, I used Windows Notepad to save the file to "ASCII" and then used WinMerge to locate the differences.

I would have liked OxygenXml Editor to tell me where the illegal characters were located.
sorin_ristache
Posts: 4141
Joined: Fri Mar 28, 2003 2:12 pm

Re: Position of illegal characters not reported

Post by sorin_ristache »

Hello,

We will study the possibility of finding the first position where the sequence of bytes cannot be converted to the charset declared at the beginning of the XML file and we will display this position in the error message. By default the Java encoding classes do not report this position.


Thank you for your suggestion,
Sorin
Radu
Posts: 9051
Joined: Fri Jul 09, 2004 5:18 pm

Re: Position of illegal characters not reported

Post by Radu »

Hi,

Just to make sure we understand your use-case:
You open a file in Oxygen (encoded in UTF-8 for example), you change the encoding in the XML header to "ascii" and then save.
When saving, the error message does not offer any indication of the line/column where the "offending" character is.

Is this correct?

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
csalsa
Posts: 97
Joined: Tue Apr 22, 2008 9:31 am

Re: Position of illegal characters not reported

Post by csalsa »

Hi

I looked at this again and now have a better understanding. I can replicate the problem with the sample:

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
<message>
<test>abc</test>
</message>
Use a binary editor to change a character of the text of <test> to 'FF'. For example, the change 'b' in "abc" to 0xFF. This is an illegal value for utf-8 character encoding and causes the error.

I understand that the error message comes from the Java framework, java.nio.charset.MalformedInputException, and that the exception does not provide position details of the illegal character.

I suggest that if this exception occurs when OxygenXML loads a document, that the document is reloaded as a byte stream and a scan is made of the stream to determine where the illegal character(s) is located. The position can then be reported to the user - though they might need a binary editor to fix the file.

My original problem was the file was 2.5MBytes and I had no idea where in the file the problem characters were.
Radu
Posts: 9051
Joined: Fri Jul 09, 2004 5:18 pm

Re: Position of illegal characters not reported

Post by Radu »

Hi,

There are two situations in which problems like this can occur.
1) Loading an XML file which specifies an encoding and contains characters which cannot be loaded using that encoding.
2) Adding illegal characters to an already loaded file and then trying to save it.

We already improved the second case to also show the line/column of the "offending" character so this fix should be available in Oxygen 10.3.

As I understand now you are interested in the first case.
We'll try to fix it in a future version.

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
csalsa
Posts: 97
Joined: Tue Apr 22, 2008 9:31 am

Re: Position of illegal characters not reported

Post by csalsa »

Thanks Radu
Post Reply