Home
A Question of Sorts
Sorting Files

Sorting Files

To get something up and running for the specific need of our client we knew we could rely upon all of the records being of fixed and equal length and on the fact that we only wanted to sort the file based upon a few bites at a specific location within each record. So for a first draft we resolved to read through the input file extracting the sort "key" from each record and storing it in a variant array. We then sorted the array in memory and then created a new output file with all of the records in the new sorted order. The code was written as a DLL to make it readily available for re-use. This produced the tool we needed for the job in hand but then, of course, we started to think about very large files or very large keys. The sort routine was rewritten to work against a work file on disk rather than an array in memory and we then had a sort routine with a very low memory overhead and the capacity to sort large disk based files very quickly.

So far everything was based upon "ascending keys" and fixed length files. An effective utility needed to be able to sort not just multiple keys but allow for some keys to be "ascending" and others to be descending. We also needed to provide support for variable length records. Now it would be nice to have an ActiveX EXE and provide a command line interface as well. You start off with a simple need and you end up writing a mound of code - still it must be seen as an investment I suppose.

One pain about this whole process is that you do not know what file type you are dealing with. It would be helpful to know if the file was written as a Binary file (or Random Access) or if it is a text file with or without variable length records. We could think of ways to guess that a file was a text file but could see no way of identifying the record length of a binary type file without the user (or the calling software module) telling us.

You can download the full source of our solution from the downloads page but do read the overview below to get some idea of how we tackled the problems.

The input file is read, record by record. The sort key(s) are extracted from the records and stored in a binary work file along with the record number.

The software allows the user to specify multiple sort keys by providing a start position in each record, a key length and an indicator to say if the key is ascending (A, B, C … etc) or descending (Z, Y, X… etc).

The sort work file is sorted using the same techniques as the variant array in the sort algorithm code samples found elsewhere in this cookbook. Each record (sort key and record number) is treated as an array element and the "subscript" is the position of the record in the file.

After being sorted the work file of keys is read in sequence – actually ignoring the keys now and just reading the record numbers. The record numbers are used to read the input records from their location in the input file and they are written out to the output file in the sequence of the sorted keys.

The sort work file and the output file are both temporary files ( please see the cookbook section on creating temporary files) and the final action of the sort process is to rename the output file to the file name requested by the user.

This works just fine for files with equal record lengths. Sequential type files (typically text files) often have variable length records. Reading in those variable length records is trivial and the act of extracting the keys and sorting them identical to other file types. The slight ‘hiccup’ came when tackling the problem of reading the file records in the sorted order when the time came to write them out again. What we were looking for was a method of moving the "file pointer" to the correct position in the file and unfortunately such files do not have a facility to directly influence the file pointer. A quick look at the relevant Windows calls showed that we were going to have to keep a track of each record position in order to move the file pointer that way so… The solution was another binary sort work file (temporary of course) that stored the position and length of each record. The (text) input file was then reopened as a binary file and the additional sort work file used to locate and read in each record as it was required to create the sorted output file. Simple, and quick too!

This code module will shortly be available for download under an Open Source license. It is will also be available as a DLL and with a user interface for non programmers. Please contact DevTeam@adit.co.uk if you can’t wait for us to post our new pages.

Google
 
Web www.adit.co.uk
www.aditsite.co.uk