UniData, like other weakly-typed systems, makes some programming tasks easier by not needing the developer to declare and adhere to a data type with variables. The general pros and cons of this have been debated many times across many languages and hence will not discussed here. What will be discussed is specific cases where this can cause you unexpected headaches.
A posting was made on the u2ug forums by Koglan Naicker about some unexpected issues when finding duplicates in a data-set.
In short, he found that when strings contained large numbers, it would sometimes incorrectly evaluate two different strings as equal. For example:
IF '360091600172131297' EQ '360091600172131299' THEN CRT "Equal"
The above code results in “Equal” being displayed on the screen. This is caused by a combination of 2 factors.
The first being that UniData is weakly typed. This means that it does not explicitly distinguish between strings and numbers, but attempts to determine the data type by examining the data. In this case, since the strings are numeric, it automatically treats them as numbers.
The second part of this issue is because now that it is treating those 2 strings as numbers, it needs to handle them in an appropriate data type on the CPU. Since the 2 strings are too large to be treated as an integer, they get converted to a floating-point number. Due to rounding that occurs, this actually results in both of these strings being converted to the same float-point representation! A better method may have been to use something such as Bignum instead of converted to floating-point. There would be a speed trade-off, but surely that would have been better than potentially incorrect programs.
Some people suggest prefixing or appending a non-number character to each string to force them to be treated as a string. Not entirely elegant and can have performance implications. Fortunately, UniData does have proper functions to handle these situations. In the case where you will be comparing strings that may consist of only numeric characters, you should use the SCMP function. This function compares two strings as strings, regardless of the actual data in them. Using this when you need to be certain how the comparison is performed can save you a lot of headaches in the future.
Also of interest is that this issue doesn’t just apply to UniBasic, but can also affect UniQuery!
It should be noted though, this only affects UniQuery when the dictionary item is right-aligned with the format field (eg, 20R in attribute 5).
You can tested this by creating a file and creating 3 records with the @ID of ’360091600172130474′, ’360091600172131297′ and ’360091600172131299′.
Now, select upon the file where the @ID = ’360091600172131297″ and you can see that 2 records are returned!
When explicitly selected a record via a unique key, this isn’t the result a database should return.
So, when dealing with large, potentially numeric fields with UniQuery, you may need 2 dictionary items. A left-aligned one for selecting on and a right-aligned one if you require numerical sorting.
One of the benefits with U2 Data servers is that it can be extremely quick to turn-around a new system. The unfortunate downside is that this makes it extremely easy to ignore the architecture of your system. This can lead to future system performance issues and harder to maintain programs.
Here I’ll be looking at the set up of your files and records (tables and columns for those still grasping UniData/UniVerse). Your system revolves around your data, so if you don’t get it right to start with it inevitably leads to a sub-optimal system. What I won’t be discussing here is the usual modulo/block-size related maintenance of your files; there is already literature in the manuals for this topic.
To start with, you should have already read my previous post about correctly setting up the layout of your files and the need to create all the relevant D-type dictionary items. With that in mind, I have a story for you…
This story is about Johnny and Alicia, who are both admin staff working for a sales company back in the 1930′s. Both have a large set of contracts that they store in folders in a filing cabinet.
Occasionally their managers will ask them to find a contract that is being handled by a certain sales rep. Although they hate this task, each time they manually search through the stack of contracts to retrieve it. Funnily enough, in the time it takes Johnny to find one, Alicia can usually find at least two.
Curiosity gets the better of Johnny who eventually asks Alicia how she was so fast.
“It’s easy, I have moved the page with the sale rep’s name to the front of the contract”
Dang! So simple! Johnny realised having to dig ten pages deep on each contract was so senseless!
Fortunately, admin staff can now use digital retrieval systems, so they don’t have to think about this sort of small detail any more. The need to pay attention to this detail hasn’t gone away though. Now it rests with us.
Not only should you ensure the layout of data is in the correct format, but you should also pay attention to the order of your data. It should be organised with the most frequently searched upon and utilised data earlier on in the record. Since the record fields are separated by delimiters, using and querying later attributes requires the engine to scan every character up until to the requested attribute to determine where it starts. By moving the most frequently used data to the being of a record, you reduce the amount of work required to initial find the data.
Here are some timings from a simple test run I performed on our system.
The setup: A file with modulo 10007, pre-filled with records keyed from 10000 to 99999. Attributes 1, 2, … up until 29 are each set to the key. I have created a D-type attribute for each one timed (D1, D2 & D29).
The test: Perform a select on the file with the attribute equal to a value (E.g. SELECT TIMINGS WITH D1=”12345″). Repeat this 1000 times for each attribute tested.
Data in <1>: 338655 (100.00%)
Data in <2>: 342134 (101.03%)
Data in <29>: 471811 (139.32%)
Even with these small records, you can see the difference you can achieve by having your data in the correct order. Scale this up to larger files with bigger records, more complex select statements combined with the processing of these records in your subroutine and it can provide a significant difference in the execution times across a system.