Sunday 28 August 2016

Planning CSV extract – Delimiters and the lies they hide



Well, I was working on a research project to see if I can design a CSV extract parser for the Planning metadata extract and since Java is my poison, I was using it to see if I could design a slightly UI based metadata parser… 

But, it was slightly more fun than that…

Now, when we say that the Planning metadata extract is a CSV file this is what we all assume it to be…A file that is delimited by commas for fields and although it may span across multiple line, it will still be one line that has been text wrapped…

Now this is the file that I was using for testing the metadata parser…This was a good example…Perfect example of a good test case because hardly 20 lines exposed a new flaw… 


I was using the Java function split() to break a line into different tokens and then mapping it to the first line which is a one-to-one mapping of the fields…

However, when my program ran on line 12, I got an ArrayIndexOutOfBoundsException… This happens when you try to read an element with an index whose value is greater than the last element index… i.e if the array has size 10, if I try to read the 20th element in the array I will get an ArrayIndexOutOfBoundsException…

Now, let us look at the offending line…



The metadata spans across multiple lines if you have an “if” formula tagged to a member in the CSV extract…The main assumption about the extract goes to toss that a single line represents all information of a member… New solution needed alert…

So, now I had two questions regarding the extract…

  • How does Planning read the extract?
  • How does it know if it has met a line delimiter or a field delimiter?

Now, since I am using Notepad++, I tried to see what the delimiters are… Below is the same file with the delimiters being shown…








Observe the delimiters in the file for line 11 to 15… Line number 11 has a line feed as a delimiter and line number 15 has a line CtrlR+Line Feed as a delimiter…

Mystery solved… Planning CSV parser uses a Ctrl R + Line Feed to indicate the end of record… A simple line feed indicates that more of the records is present on the next line…

This is a problem with different delimiters being used for different environment…In some systems line feed may be a single character whereas in others (Windows), line feed is made up of two characters…

(Interesting fact that explain why we need two characters for newline traces back to printers and type writers… Talk about old school… In a typewriter, a new line consists of motion across two axes…Down and to the left…That is why you need two characters for newline in some system…However, for a software system we can combine these two into just one special character)


Now, I was using Java and thought that I will read a line using the readLine() function, but catch with this is that if you use readLine(), Java will trim the end-of-line delimiter… This means that the Planning CSV parser reads the extract at a character level and parses the extract when it meets the end-of-line character…

Now, the best part, this is what the extract looks like in my parser after parsing…




I am still working on the CSV parser…Need to test it for all ifs and buts and improve the UI a bit…



No comments:

Post a Comment