Everyone here probably knows the various levels of caching that exist on a modern computer: From multiple CPU caches through to disk cache and even caching in the database engine itself. If you want to quickly touch up on some caching concepts/terminology, check out this short slide deck from Serhiy Oplakanets on Caching Basics
What I’m going to do shortly is outline some other methods of gaining significant performance improvements on your UniData and UniVerse systems.
There really isn’t anything special outside of U2 that you will need to do to get benefits from this, although a few extra tricks that do require either additional hardware or OS work can give quite a boost
First, just to make sure everyone is on the same page: Since UniData and UniVerse support hash-tables as their file (table) structure, you can simply use a file as a gloried key-value store. Key-value stores are ideal for caching.
I’ve dividing this post into 4 sections:
Let me know what you think.
Session Level Caching
COMMON provides a method of keeping a small in-memory cache for the entire duration of a session. Simply declare an array in a named common block and away you go.
A real world example, I’ve seen this used for when a dictionary item made a SUBR call to a subroutine that in turn would read a multitude of control items to process the original record. This dictionary item was called nightly by an external reporting tool on a large number of records.
The original solution had an unacceptable run-time and after some profiling, it was determined that the READs of the control items were the main culprit. Since it was known that the control items would not change (and should not) during the processing, it was determined that caching the control items in-memory after they were read would reduce the run-time.
The solution involved: An array of ‘x’ elements. When a control item needed to be read in, it checked this array via a simple look-up and if it existed, it used it. If not, it would read it from disk and store it in the array.
The result: 10+ hour run-time was now less than 1 hour.
Alright, so you have a system that needs to handle some messages (perhaps via some form of SOAP/REST web service) The majority are read requests with a few write requests for good measure.
One of these messages is to ‘Get Products’. This message returns a list of products (ID, name and latest available version) that a customer currently has.
In your system, there are 2 files used by this request. ‘CUSTOMERS’ and ‘PRODUCTS’. CUSTOMERS<30> is a multivalued list of record ids for ‘PRODUCTS’. PRODUCTS<1> is the name of the product and PRODUCTS<11> is the latest available version.
Traditionally for each ‘Get Products’ request your system would read in the appropriate record then read in all the linked records from PRODUCTS to compile the response to the query. Assuming an average customer has 10 products, the average disk reads for this query is 11
Now this query is being called a lot, all these extra disk reads and processing are beginning to cause performance impacts. Thankfully, because your database supports key-value storage, you can quickly implement a cache to sit in between the receipt of the message and the processing.
All that is needed is a new file called ‘CACHE.GETPRODUCTS’. @ID is the CUSTOMERS id requested in the query, <1> is the date requested, <2> is the time requested and <3> is the response generated
Now, when ‘Get Products’ query is received, it will first do a read of the cache file and if it exists, simply return <3>. If the entry doesn’t exist, it will hand the request/response off to the usual processing routine. The subsequent request will then be stored in the cache before being returned.
Assuming the average declared above, a cache hit will result in 1 disk read and a cache miss will result in 12 disk reads and 1 write. If – for ease of math – we treat a write equal to a read, you only need a 16.7% Cache hit rate for it to perform better. That isn’t even taking in to considering CPU usage reduction, better disk cache performance, etc.
How you handle cache invalidation is dependent on your situation. It could be as simple as clearing it every ‘x’ period, as straight forward ignoring the cache record if it is older than ‘y’ time or as complex as individually invalidating records based on when the appropriate records in CUSTOMERS or PRODUCTS change.
What has been implemented here is a cache that is available not only in the current session, but to any program running or that will be run in the account(s) that have access to this cache file.
Improving the above
Okay, so you have a more intensive system than the above and you have determined caching can help you out. The problem is, even with the caching it still doesn’t meet your requirements and disk has been determined to be the biggest bottleneck.
You have 2 next steps that can be implemented easily.
The Disk Approach
Simple drop in a shiny new SSD drive or a WD Raptor and move the cache files over there. No need to back them up, mirror them or anything else as caching files are temporary data. As long as your system is setup to recreate them if missing on start-up and treat it as a cache miss if unavailable during operation, you are all set.
The benefit here is faster disk access as well as moving the activity off on to another device/bus.
The RAM approach
Instead of adding new hardware, perhaps you’d prefer to spare 64MB of RAM to the cause. In this case, you would simply create a RAM Drive and move the cache files there. You have now essentially created a RAM based key-value store to use as your heart desires.
For an example of what type of improvements this can have, I took the DOSAC test I previously created and ran it twice. Once with the file on traditional disk and once with the file on RAM Disk. The system stats are identical to last time I ran the test, except it was on Fedora (it comes with multiple 16MB RAM disks pre-configured).
That’s right: Massive improvements, as expected (excuse the display text bug).
So, keep this in mind. U2 Databases give you some great flexibility in how you implement your software. Knowing the options available is crucial to being able to get the best results.
As the saying goes, measure twice, cut once. Work out what your performance bottlenecks are then determine the best solution. Some times it is better hardware, sometimes it is code clean up. Sometimes… it might just call for caching.
A few years ago I read an interesting article titled Denial of Service via Algorithmic Complexity Attacks. When I started working with UniData, It never crossed my mind that U2 had the same class of vulnerabilities, but it does.
If you develop for a U2 system where you cannot afford for malicious internal/external entities to adversely affect system performance, then I highly suggest you read the above linked paper.
I’ll divide this into 3 sections.
The first place I’ll draw your attention to is the humble hash file at the core of UniData and UniVerse. As you probably know, each record is placed in a group dependant on the hash value of its record ID, along with the modulo and hashing algorithm of the file. Now, there are 2 hashing algorithms that a hashed file can use. Type 0 or ‘GENERAL’ is the default, general use hashing algorithm, whereas Type 1 or ‘SEQ.NUM’ is an alternative you can specify and is designed to handle sequential keys. The hash file is basically a hash table with chaining.
Let’s assume we’re working at the HackMe Ltd company that has made a public website to integrate with their existing backend system, which is UniData driven. It is decided that people can pick their own usernames when signing up. Since these usernames are unique, they have been used as the record ID.
Ever since he was not hired after interviewing at HackMe Ltd, Harry has wanted to show them up. Knowing that they used UniData on the backend from his Interview (and their job ads), he installed UniData and makes some initial guesses at the modulo for their ‘users’ tables and calculates a few usernames sets for different modulus.
Now, by going to their website and taking timings for the “Check username availability” feature, Harry was able to become reasonably sure of the modulo for the file. Setting up his computer to run all night generating keys that hashed to a single group. Setting up his email server to automatically do a wget on the confirmation URL on received emails (hence getting around the “Confirm email address” emails).
The next day he runs a script to sign-up all the usernames gradually over the day. After they have all been signed up, Harry now simply scripts a few “Check username availability” calls for his last username generated to start his Denial of Service attack. Essentially, he has taken the non-matching lookup performance of the hash file from O(1 + k/n) to O(k) (where k is the number of keys and n is the modulo). Even worse than that, because of how level 1 overflows work, it now requires multiple disk reads as well (UniData only I believe). Continual random access to that file that is heavily weighted in one group is O(k^2)
Now, to give you a visual example, I have run a test on my home machine and produced 2 graphs.
CPU: Core Duo T7250 (2.0GHZ)
OS: Vista SP2 (32-bit)
DB: UniData 7.2 PE (Built 3771)
Hash File: Modulo 4013 – Type 0
Pre-generate 2 sets of numbers. One is of sequential keys, the other is of keys chosen because they all hash to a single group. Timings are recorded for the total time in milliseconds for:
- Write null records for all the keys and
- read in all the records.
Separate timings for sequential and chosen keys are taken. The test is repeated for different key counts from 1000 to 59000 in 1000 increments.
First Graph – Sequential key timings by themselves:
Second Graph – Chosen key alongside sequential key timings:
Naturally, timings are rough, but they are accurate enough to paint the picture.
Actually, now that I’ve mentioned painting…
Have you heard of Schlemiel the Painter?
Schlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck.
The next day Schlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck.
The next day Schlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?”
“I can’t help it,” says Schlemiel. “Every day I get farther and farther away from the paint can!”
(Credit: Joel Spolsky, 2001)
When looking at Dynamic arrays in U2, you should see how they can be exactly like a computerised version of Schlemiel the Painter. In fact, a public article on PickWiki pointed this out quite some time ago. UniData is affect more so than UniVerse, in that UniVerse has an internal hint mechanism for attributes. The problem with this is, if an uncontrolled (eg, external) entity has control over the number of items in the dynamic array, you could be vulnerable to a Denial of Service attack. It could even be unintentional.
So, let’s see what all fuss is about. Firstly, a quick recap on the issue with dynamic arrays.
Essentially when doing an operation like “CRT STRING” it has to scan the string character by character counting attribute, multi-value and sub-value marks as it does. If you increment Y or Z (or X in UniData’s case) and do the same operation, it has to re-scan the string from the start all over again. As the number of elements increases, the more noticeable the flaw in this method becomes. In fact, cycling through each element in this manner is an O(k^2) algorithm.
I’ve seen this issue bite and bite hard. It involved 2 slightly broken programs finding just the right (or wrong) timing.
The first program was a record lock monitoring program. It used GETREADU() UniBasic command, after which it looped over every entry and generated a report on all locks over 10 minutes old. This process was automatically scheduled to run at regular intervals. It had been operating for months without issues.
The second program was a once off update program. Basically, it read each record in a large file, locked it then if certain complex conditions were met, it updated an attribute and moved on to the next record. See the problem? It didn’t release a record if it didn’t need updating. The processing was estimated to take about 30 minutes and as it turns out, not many records met the complex conditions.
See the bigger problem now? Yup, that’s right, the dynamic array returned by GETREADU() was astronomical! This resulting in the monitoring program saturating a CPU core. The same core the update program was running on. Uh oh! System performance issues ensured until the culprit was found and dealt with.
So, what do we do about these issues? You want a stable system right? One that is less easy to bring to its knees by malicious users and unfortunate timings of buggy code?
DO NOT use external input as record keys! Place it in attribute 1, build a D-type dictionary and index it if you need, but not use it as the @ID!
A further option would be to have Hash files and their hashing algorithms updated to be able to deal with this type of malicious edge case. Other languages have (take Perl for example) updated their hash tables now to use hashing algorithms to be seeded at run-time. These means you cannot prepare ‘attack’ keys ahead of time and cannot replicate how the hashing works on another computer, since the hash algorithm will be seeded differently. Obviously, this cannot be done exactly the same with Hash files, is they are a persistent data store. It could however be done on each CREATE.FILE. That way, even if a malicious party can determine the modulo of a file, they be able to duplicate it on their system as each file will be seeded differently. Doing this would bring UniData and UniVerse inline with the security improvements made in other modern stacks.
This one is simple. Use REMOVE, don’t use simple FOR loops. Think through your data and were it is being sourced from. Is it from external entities? Is it from internal entities whose behaviour cannot be guaranteed to remain within safe bounds? If the answer to either of those questions is even a ‘Maybe’, stay safe and use REMOVE.