Archive

Posts Tagged ‘UniData’

Data Integrity

February 27, 2011 Leave a comment

One of the features not present in UniData that you many have become used to in the world of SQL is referential integrity.

Data is one of the most valuable assets of a company. If only for this reason alone, it should be treated with the utmost respect and professional care. Everybody knows that backing up data is essential, but what data are you backing up?

If the data is already corrupt you’re in a whole world of hurt. How long has it been it corrupt? Has it corrupted other data? Has it already impacted the business and to what extent? You can’t just restore from several months ago. You have to spend the time manually working out what went wrong, how to fix and potentially trawling through backups to find data to reinstate.

Here I should clarify exactly what I’m referring to by ‘corrupt data’. I’m not talking about OS-level corruption; from here on I will be talking about 2 types of logical corruption:

Unlike the major databases (such as MSSQL, Oracle and MySQL) UniData and UniVerse do not have logical data integrity constraints supported in the database layer. This leaves it up to each individual application to ensure data integrity.

Anyone working with databases knows that bugs (both current and of old) can result in logical inconsistencies creeping into your data. The more denormalised your data, the higher the chance for this corruption.

Some of this corruption will become apparent immediately because a process will fail and require you to locate and fix both the cause of the corruption as well as the corruption itself. Surprisingly, these are not the ones you should be most worried about. The worst are the ones you don’t notice, because they don’t cause the system to visibly malfunction. These are the worst because they can fester in your system for years, silently corrupting data that is derived from it and potentially impacting business decisions. Soon the data itself will become much harder to repair since needed information may no longer be readily at hand. If/when these eventually cause a problem, it will be much harder and time-consuming to address, if even possible.

Since we have to handle logical data integrity entirely in the application layer, U2 databases are somewhat more susceptible to these issues from code bugs. To combat this, there are 2 methods I propose you adopt.

The first is a Data Integrity Audit (DIA) you can schedule regularly in production. This validates your data and reports on any inconsistencies it encounters. This helps you identify issues earlier and potentially help track down the programs/conditions that are causing the corruption. We have already implemented this system for ourselves and I’ll explain how we did it below.

The second method is based on the above DIA. Modifying it to run from file triggers, you can implemented a system to use while testing (Unit, System and at User Acceptance Testing) that can report exactly what program/line is writing the corrupted record as it happens. Catch it BEFORE it reaches production! However, I don’t recommend actually implementing this into production (at least, without great care/load testing) since it will have performance implications that may be unacceptable.


Implementing a solution

Alright, enough of the prelude. Lets talk about implementing a DIA program in to your system. It isn’t as hard as you might think and it can be set up incrementally so you can cover your most important data first.

The system has 4 parts to set up:

  1. Defining the Rules
  2. Storing the Rules
  3. Checking the Data
  4. Reporting on Violations

Defining the Rules

The first step is the logical rules that should be constraining your data. The rules will fall into 2 categories:

  • Referential integrity: Identify any attributes that are foreign keys (or lists of foreign keys)
  • Domain integrity: Specify the ‘domain’ of the field. This includes type (alpha, numeric, etc), enumerations, length, and if NULL is allowable.

Looking at a few of your key tables, you should be able to quickly identify some basic rules your data naturally should abide by. Write these down as these will be some easy rules to start testing.

Storing the Rules

The second step is determining how to store the rules. Although you can do this however you want, there are several reasons that make using the dictionary file ideal:

  • Placing the constraints in with the schema (both are structural metadata). Collocation is a good thing.
  • Attribute 1 can store anything after the type; it allows you to store the constraint directly with the section of the schema you are constraining!
  • X-Type Attributes allow you to use enumerations (part of domain integrity) while still keeping them defined in the schema, instead of elsewhere.
  • It allows you to easily test and/or enforce the constraints with the ‘CALCULATE’ function (more on this later)

So, how exactly do you store the constraints in with the dictionary records? Here is the following scheme we use:

TYPE [FKEY filename [PING f,v,s]] [MAND] [ENUM enum_item]

  • FKEY: Foreign key to ‘filename
  • PING: Checks for @ID in the foreign record location <f,v,s>
  • MAND: Value cannot be NULL
  • ENUM: Value must be an enumeration in the dictionary X-type record ‘enum_item

When attribute 6 of the dictionary item indicates that the data is a multivalued list, FKEY, MAND, ENUM and DATATYPE should adhere to it and treat the each item in the list separately. The only special case is MAND, which only causes a violation when a multivalue in the list is empty. That means it does not cause a violation when there is no list at all. If you want to cover this you can create another non multivalued dictionary item as well and apply the MAND rule to it.

Checking the Data

The third part is how you will test/enforce these constraints:

  • Production: A program, that given a filename, reads in the dictionary items and associated constraints. It can then test each record and report any violations. This would typically be run as part of a nightly job, and/or if you are set up for it, on a backup/restore of production onto a development machine.
  • Development: An update trigger subroutine that is only implemented on development. This also allows you to transparently test if new or modified code is corrupting your data before it even makes it into production. Although this would typically not be implemented into your actual production system due to performance impacts, there is no technical reason that it cannot be done if so desired (even just for selected files)

These methods are not mutually exclusive and are designed to cover different situations. The first is a post corruption check that allows you to identify issues faster than you normally would. The second allows you to provide better test coverage and reduce the risk of introducing faulty code into your production system.

Reporting the Violations

The fourth and final part of the system is how you report it.

There are many options you many want to consider depending on your needs and which of the 2 options above you are considering it for.

We decided upon a non-obtrusive option that allowed us to build either reports or select lists from the results. This method requires you to create a new file to store the results. For the sake of this article, let us call it DIA_RESULTS. You can clear this file just before running the DIA program, or performing tests if you are using the trigger method.

In DIA_RESULTS, each record should contain the following information:

  • Date failed
  • Time failed
  • Filename the violation was on
  • Key the violation was on
  • Dictionary item used when the violation occurred
  • Rule name the violation occurred on
  • The value that caused the violation (just in case it changes before you get to it)
  • If from a trigger, the current call stack

Using this information it is easy to print off reports, create select lists to get to the records and to determine exactly what was wrong in the data.

Advertisements

Cheat Sheet: UniBasic Debugger

October 30, 2010 Leave a comment

I’ve just completed the first release version of a small cheat sheet for using the UniBasic Debugger in UniData. Please give feedback if you find any typo’s, misinformation or just if you find it helpful and what to let me know. :)

Preview - Debugger Cheat Sheet

Preview - Debugger Cheat Sheet

Click below to go to the page with the PDF available for download:

UniBasic Debugger Cheat Sheet v1.0.0

I’m currently making a few others, so if you like the idea, stay tuned over the next couple of months.

The problem with numbers

October 11, 2010 4 comments

UniData, like other weakly-typed systems, makes some programming tasks easier by not needing the developer to declare and adhere to a data type with variables. The general pros and cons of this have been debated many times across many languages and hence will not discussed here. What will be discussed is specific cases where this can cause you unexpected headaches.

A posting was made on the u2ug forums by Koglan Naicker about some unexpected issues when finding duplicates in a data-set.

In short, he found that when strings contained large numbers, it would sometimes incorrectly evaluate two different strings as equal. For example:

IF '360091600172131297' EQ '360091600172131299' THEN CRT "Equal"

The above code results in “Equal” being displayed on the screen. This is caused by a combination of 2 factors.

The first being that UniData is weakly typed. This means that it does not explicitly distinguish between strings and numbers, but attempts to determine the data type by examining the data. In this case, since the strings are numeric, it automatically treats them as numbers.

The second part of this issue is because now that it is treating those 2 strings as numbers, it needs to handle them in an appropriate data type on the CPU. Since the 2 strings are too large to be treated as an integer, they get converted to a floating-point number. Due to rounding that occurs, this actually results in both of these strings being converted to the same float-point representation! A better method may have been to use something such as Bignum instead of converted to floating-point. There would be a speed trade-off, but surely that would have been better than potentially incorrect programs.

Some people suggest prefixing or appending a non-number character to each string to force them to be treated as a string. Not entirely elegant and can have performance implications. Fortunately, UniData does have proper functions to handle these situations. In the case where you will be comparing strings that may consist of only numeric characters, you should use the SCMP function. This function compares two strings as strings, regardless of the actual data in them. Using this when you need to be certain how the comparison is performed can save you a lot of headaches in the future.

Also of interest is that this issue doesn’t just apply to UniBasic, but can also affect UniQuery!

It should be noted though, this only affects UniQuery when the dictionary item is right-aligned with the format field (eg, 20R in attribute 5).

You can tested this by creating a file and creating 3 records with the @ID of ‘360091600172130474’, ‘360091600172131297’ and ‘360091600172131299’.

Now, select upon the file where the @ID = ‘360091600172131297″ and you can see that 2 records are returned!

Results of selection

Non Unique Select

When explicitly selected a record via a unique key, this isn’t the result a database should return.

So, when dealing with large, potentially numeric fields with UniQuery, you may need 2 dictionary items. A left-aligned one for selecting on and a right-aligned one if you require numerical sorting.

Optimising the set up of your UniData data [Part 1]

September 3, 2010 1 comment

One of the benefits with U2 Data servers is that it can be extremely quick to turn-around a new system. The unfortunate downside is that this makes it extremely easy to ignore the architecture of your system. This can lead to future system performance issues and harder to maintain programs.

Here I’ll be looking at the set up of your files and records (tables and columns for those still grasping UniData/UniVerse). Your system revolves around your data, so if you don’t get it right to start with it inevitably leads to a sub-optimal system. What I won’t be discussing here is the usual modulo/block-size related maintenance of your files; there is already literature in the manuals for this topic.

To start with, you should have already read my previous post about correctly setting up the layout of your files and the need to create all the relevant D-type dictionary items. With that in mind, I have a story for you…

This story is about Johnny and Alicia, who are both admin staff working for a sales company back in the 1930’s. Both have a large set of contracts that they store in folders in a filing cabinet.

Occasionally their managers will ask them to find a contract that is being handled by a certain sales rep. Although they hate this task, each time they manually search through the stack of contracts to retrieve it. Funnily enough, in the time it takes Johnny to find one, Alicia can usually find at least two.

Curiosity gets the better of Johnny who eventually asks Alicia how she was so fast.

“It’s easy, I have moved the page with the sale rep’s name to the front of the contract”

Dang! So simple! Johnny realised having to dig ten pages deep on each contract was so senseless!

Fortunately, admin staff can now use digital retrieval systems, so they don’t have to think about this sort of small detail any more. The need to pay attention to this detail hasn’t gone away though. Now it rests with us.

Not only should you ensure the layout of data is in the correct format, but you should also pay attention to the order of your data. It should be organised with the most frequently searched upon and utilised data earlier on in the record. Since the record fields are separated by delimiters, using and querying later attributes requires the engine to scan every character up until to the requested attribute to determine where it starts. By moving the most frequently used data to the being of a record, you reduce the amount of work required to initial find the data.

Here are some timings from a simple test run I performed on our system.

The setup: A file with modulo 10007, pre-filled with records keyed from 10000 to 99999. Attributes 1, 2, … up until 29 are each set to the key. I have created a D-type attribute for each one timed (D1, D2 & D29).

The test: Perform a select on the file with the attribute equal to a value (E.g. SELECT TIMINGS WITH D1=”12345″). Repeat this 1000 times for each attribute tested.

Results:

Timings for 1000 SELECTs

Timings

Data in <1>: 338655 (100.00%)
Data in <2>: 342134 (101.03%)

Data in <29>: 471811 (139.32%)

Even with these small records, you can see the difference you can achieve by having your data in the correct order. Scale this up to larger files with bigger records, more complex select statements combined with the processing of these records in your subroutine and it can provide a significant difference in the execution times across a system.

Open Source: Some Positive News

August 12, 2010 Leave a comment

In case you haven’t noticed, open source has exploded into the mainstream and the profitable band-wagon that has built up revolves around “setup & support”, “customisation” and “enhanced enterprise editions”.

Yes, a lot of those companies do not solely handle FOSS projects, but it is a valuable part of their business.

The best part of FOSS is that because it is free and readily available, the potential people who will be exposed to the product is greatly increased. With MV-style databases largely unknown (and not understood), having more people aware of the technology can only improve the scene for us that work with it. The more companies using it means more jobs. Who can argue that against that?

That’s why Brian Leach’s announcement at the end of July is such a positive step for the community at large.

mvScan was originally a tool that I had developed for my use, to document a UniVerse system by iterating through the account and file structures, building impact maps and filling out tables with information culled from the entries found to make it easier for someone to search through their system.

So I’ve decided the best way forward is for me to open it up. That way, people who want to run it on their systems can do so and feed back any updates and changes that result from applying it to their specific structure and code organization.

So watch this space for announcements. If this goes well, there’s plenty of other stuff I want to open source.

You can read more on mvScan at Brian’s site.

I don’t know about you, but I’m looking forward to the release.

Crouching Null, Hidden Bug

May 2, 2010 1 comment

Null, (actually just an empty string “” in U2) is a valid value. Normally, it would be treated exactly the same as other normal values, such as 1 or “1”, but it isn’t always.

I’ve seen a few bugs that have been created by not understand the differences in how nulls are treated. When debugging the code can look completely valid as well, meaning it takes even longer to identify and rectify the issue.

Okay, lets see how you go. I’ll give you a few series of records, each created with the same data. Your job, is to work out which records in the series will exactly match their ‘CONTROL.REC’. Good luck and try to do it without needing to compile the code!

Series 1:

 CONTROL.REC = "" : @AM : "A"

 DIM REC(3)

 REC(0) = ""
 REC(0)<2> = "A"

 REC(1) = ""
 REC(1)<-1> = "A"

 REC(2) = ""
 INS "A" BEFORE REC(2)<2>

 REC(3) = "A"
 INS "" BEFORE REC(3)<1>


Series 2:

 CONTROL.REC = "A" : @AM : ""

 DIM REC(3)

 REC(0) = "A"
 REC(0)<2> = ""

 REC(1) = "A"
 REC(1)<-1> = ""

 REC(2) = "A"
 INS "" BEFORE REC(2)<2>

 REC(3) = ""
 INS "A" BEFORE REC(3)<1>


Series 3:

 CONTROL.REC = "" : @AM : "A" : @AM : ""

 DIM REC(3)

 REC(0) = ""
 REC(0)<2> = "A"
 REC(0)<3> = ""

 REC(1) = ""
 REC(1)<-1> = "A"
 REC(1)<-1> = ""

 REC(2) = ""
 INS "A" BEFORE REC(2)<1>
 INS "" BEFORE REC(2)<1>

 REC(3) = ""
 INS "A" BEFORE REC(3)<2>
 INS "" BEFORE REC(3)<3>


Series 4:

 CONTROL.REC = "A" : @AM : "" : @AM : "A"

 DIM REC(3)

 REC(0) = "A"
 REC(0)<2> = ""
 REC(0)<3> = "A"

 REC(1) = "A"
 REC(1)<-1> = ""
 REC(1)<-1> = "A"

 REC(2) = "A"
 INS "" BEFORE REC(2)<1>
 INS "A" BEFORE REC(2)<1>

 REC(3) = "A"
 INS "" BEFORE REC(3)<2>
 INS "A" BEFORE REC(3)<3>


Series 5:

 CONTROL.REC = "A" : @AM : "" : @AM : ""

 DIM REC(4)

 REC(0) = "A"
 REC(0)<2> = ""
 REC(0)<3> = ""

 REC(1) = "A"
 REC(1)<-1> = ""
 REC(1)<-1> = ""

 REC(2) = "A"
 INS "" BEFORE REC(2)<2>
 INS "" BEFORE REC(2)<2>

 REC(3) = "A"
 INS "" BEFORE REC(3)<2>
 INS "" BEFORE REC(3)<3>
 
 REC(4) = ""
 INS "" BEFORE REC(4)<1>
 INS "A" BEFORE REC(4)<1>


Series 6:

 CONTROL.REC = "A" : @AM : "B" : @AM : "C"

 DIM REC(4)

 REC(0) = "A"
 REC(0)<2> = "B"
 REC(0)<3> = "C"

 REC(1) = "A"
 REC(1)<-1> = "B"
 REC(1)<-1> = "C"

 REC(2) = "A"
 INS "C" BEFORE REC(2)<2>
 INS "B" BEFORE REC(2)<2>

 REC(3) = "A"
 INS "B" BEFORE REC(3)<2>
 INS "C" BEFORE REC(3)<3>
 
 REC(4) = "C"
 INS "B" BEFORE REC(4)<1>
 INS "A" BEFORE REC(4)<1>


Did you get them all? I’ve put the answers as a comment so you can check them if you want. I’d be surprised if you got them all right…

So, what should you take away from this? <-1> and INS can give you (or others!) a world of headaches if you don’t understand their peculiarities with null values.

Final Note: In UniVerse INS behaviour for some cases is dependant on the flavour your account is running in and the $OPTIONS EXTRA.DELIM setting.

Improving U2 Security

April 11, 2010 3 comments

The general IT knowledge of security has come along way in the last 20 years. Even more dramatically when considering the last 10 years.

People are generally aware that unless due care is taken, their computer could be injected with a virus, have personal information stolen from it or even be used to facilitate crime. Major OS Vendors have picked up their game and now are putting in a better attempt to prevent compromises from the OS level. Sure, you still hear the odd story about the latest privilege escalation, but compared to what it use to be…

Network level security has been given most of the attention (and IT budget funding) and is *generally* fairly secure these days. Application level is where most of the major hacks are happening now, but unfortunately, corporate uptake on securing their systems at the Application level hasn’t been as good as it was with the Networks.

Let’s be honesty and not undersell ourselves, securing complex applications is no mean feat. It takes knowledge, planning, lots of time & patience and sometimes out-of-the-box thinking. Thankfully, most modern programming languages and Database Management Systems do the heavy lifting for us. From the security features built into C# and Java to the vastly improved safety net found in SQL engines with fine-grained access control and in-built functions for preventing SQL injection, a lot of the basics have been solved.

This is where the U2 family has a few gaps to be filled. UniBasic needs some inbuilt functions for sanitisation, UniObjects needs some form of access control built around it and UniQuery/RetrieVe prepared statements/stored procedures would be nice.

With the increase push in integrating U2 servers as databases for modern front-ends such as web applications, data sanitisation is going to become a prevalent topic in the community. Built-in functions for UniQuery/RetrieVe, SQL and HTML sanitisation/encoding would be welcome additions to the UniBasic command family. Even better would be some form of prepared statements for the query languages. This make it simpler and easier to obtain better program security.

UniObjects is touted as a standard method of connecting GUI application front-ends to a U2 back-end. However, due to the limited access control supported by UniObjects, it is a dangerous hole in your system to have the required port open for anything other than back-end servers. Take into considering user ‘X’. User ‘X’ has appropriate login credentials for the old green screen system. IT brings out a new Windows GUI application, lets say for reporting, that runs on the user’s machine and uses UniObjects to connect to U2. In the old green screen system, User ‘X’ was limited to set menus and programs to run and could not get access to ECL/TCL. With enough knowledge (and malice), User ‘X’ can now freely use his green screen login credentials to log into the U2 system via UniObjects read/write records directly and even execute raw ECL/TCL commands.

So what exactly is the problem with UniObjects? Quite simply put, it has no fine-grained server-side control of what actions can be done, or commands issued via UniObjects. As long as you can log in, you can get a free pass to the back-end’s data. Let’s take MsSQL as a counter example. You can create views, stored procedures, grant or deny users a suite of privileges to tables and commands. Essentially, UniData needs to be able to have some access control scheme for UniQuery that allow you to define whether the users and read/write records in certain files. Ideally, all read/writes would be done through U2 UniBasic subroutines, with RPC daemon having the ability to have a command ‘white-list’ setup. That way, all data access can be moderated with UniBasic code and the RPC daemon having a white-list that only allows access to calling those subroutines.

All this highlights an issue we need to overcome as a community. The lack of U2 specific security literature. Where is the UniData/UniVerse security manual? Where is the “Top 10 common security mistakes” for U2? Sadly, security does seem to be an afterthought. Sometimes even a ‘neverthought’.

%d bloggers like this: