cable.ayra.ch Help System

# Correctly working with files and directories If you work with files and directories, you can do a lot of things wrong. Here's a set of things you should consider when implementing file system routines. *Code samples are given in C#* ## Checking if a file exsist before opening it This is something everyone does at first, feels right, but is wrong. Consider this piece of code: ```c# public string GetConfigText(string ConfigName) { if (ConfigName == null) { throw new ArgumentNullException(ConfigName); } if (File.Exists(ConfigName)) { return File.ReadAllText(ConfigName); } return GetDefaultConfigText(); } ``` This seems like a reasonable function: - Make sure the supplied config file name is not null - Check if the config file exists - Return the config file text if the file exists - Return the default configuration if the file doesn't exists However, there are a few problems with this: - File.Exists returns `false` if the file exists and you lack permissions to access it - The file might be deleted/moved/renamed by another application between your calls to `Exists` and `ReadAllText` - The file might be in use and locked by another application - File system errors (for example if a drive is removed) The fix for this is simple. Instead of trying to avoid errors by checking if a file exists, you should catch and deal with possible errors: ```c# public static string SafelyGetConfigText(string ConfigName) { try { return File.ReadAllText(ConfigName); } catch (FileNotFoundException) //File not found (Maybe deleted by user) { return GetDefaultConfigText(); } catch (DirectoryNotFoundException) //Path segment not found (Maybe first run) { return GetDefaultConfigText(); } catch (IOException) //Other file system error (for example file in use) { throw; //Instead of this, show error to user, ideally with a Retry option. } //All other errors are thrown } ``` The level of error handling is obviously up to you, and not all languages provide the same level of detail in their error handling capabilities. ## Checking file size before reading This is essentially the same as in the chapter above and should be avoided. Open the file and use the provided functions/properties to check the length of the stream/handle. In some languages you might need to seek to the end of the file for this to work, and some languages can't report the size for all file opening modes. In ASCII mode, the reported size might be wrong. ## Trying to be smart when fixing errors Don't try to be overly smart when you encounter problems with file operations. Sure, trying to find out what application holds a lock on a file you want to edit might be an interesting thing to implement, but most things you will be doing are going to potentially introduce other problems, and are usually not worth the time to implement. ## Buffering You should avoid writing single bytes to files repeatedly because system calls are expensive, but you also should avoid caching a huge amount of data in memory before flushing it to disk. File system buffers have existed for decades now and are pretty solid. If possible, you should try to write a file in valid chunks, so that when your application dies for some reason, the file you have been writing to is in a valid state, or at least valid enough to be recovered. For a log file this would mean to construct the entire line in memory and writing it at once. ## Seeking Seeking a file stream will likely destroy the read and write buffer. Any pending writes will still occur properly, but you may experience seeking operations taking a long time if a lot is pending to be written. Take this into account when you create a custom file format. The biggest problem is alternating between seek and read/write. So instead of `...` it's better to do `......` or `......` Either one will allow you to read all properties in one go, and then seek to the data you need instead of alternating between reading properties and seeking over data until you reach the segment you're interested in. This is why rendering an index of a tar file with many files can be time consuming, while for a zip archive, it's fast. ## Sparse files Set the file size if you know in advance how big your file is going to be. Some programming languages have a way to set file sizes. In C#, you can use `Stream.SetLength(long)`. This avoids fragmentation and ensures that there is enough space for your file. These functions usually alow you to trim the end off a file too. There is no universal guarantee that this operation is fast. Windows for example guarantees that the contents of sparse files consists of nullbytes. The NTFS file system has the ability to declare sparse files and thus this will be very fast. FAT and FAT32 lack this ability. If you try to seek 10 million bytes beyond the file end, Windows will write 10 million nullbytes. ## Sync vs Async If you perform file system operations, theres always a chance that the file system is busy and your call needs to wait. This can be caused by something as simple as a disk needing to spin up. In general you want to use asynchronous file I/O if it's available. This can either be done with the async/await model, or by using synchronous IO inside of a thread. Threads can be suspended which gives you an easy mechanism to pause and resume operations. ## File handle lifetime Don't repeatedly use functions that open a file, write a line and close the file. Instead open the file handle, write all lines, then close it. Opening files is a quite expensive operation. ## Binary vs. ASCII mode This one is simple. When you read and write files, do not open them in ASCII mode unless you don't mind when the file contents are not exactly what you wrote into the file. This issue does not applies to all programming languages. Some languages (C and PHP for example) allow you to specify whether you want to open a file in binary or ascii mode. ASCII mode will convert LF into CRLF when you write the file. It might also stop reading when it encounters the byte 0x1A (CTRL+Z). ### About CTRL+Z CTRL+Z termination of files comes from a time where files had no size property. A file occupies a certain number of allocation units in the file system. These units have a fixed size. Often it's 512 bytes, but 4096 is getting more common with larger disks. When reading files in ASCII mode, the reader would stop if the byte 0x1A was encountered. A file would normally be padded with this up to the end of the block. Having a way to end files earlier was important because you could not reliably concatenate files otherwise. Somehow this behavior is still present in some utilities. An example is the DOS and Windows COPY command, which when concatenating files, operates in ascii mode by default. ## Assuming a file is (not) locked If your application is intended to run on different operating systems, you should not make assumptions about file locking. You should always lock a file to avoid interference with other applications. In general, this means to do the things below unless your application is prepared to deal with the consequences. - Disallow concurrent writes - Disallow removal of files/directories that are in use - Disallow renaming/moving of files/directories that are in use - Disallow reads when writing There are legitimate reasons to allow certain actions, for example, allowing other applications to read a file while you write to it is useful for log files. ## Copying file properties When copying a file verbatim, you should copy the modification time too. Timestamps are as follows (and not available on all file systems and operating systems): - Access: Time the contents (not properties) of the file were accessed last - Create: Time this specific file system entry was created - Modify: Time the contents were modified Note that when you copy files, you do not copy the time of creation, only the time of modification. This means that the creation time of a file can be after the modification time. This is no mistake. ## Moving files In most file systems, moving a file is technically the same as renaming it. This is an atomic operation. Moving a file from one file system to another on the other hand, is not. Moving a file between file systems involves: 1. Creating the destination file 2. Setting file properties of the destination 3. Copying the contents from the source to the destination 4. Deleting the source Thers's a lot that can go wrong here (not exhaustive): - The destination might already contain a file with this name (Step 1 fails) - The destination might be write protected (Step 1 fails) - The destination might be full (Step 3 fails) - The destination might not support the same set of properties as the source (Step 2 fails) - The destination might be unable to handle the file name (Step 1 fails) - The destination might be unable to store a file this large (Step 3 fails) - The destination might not support the type you copy (Step 1 fails, for example symbolic links) - The source is write protected (Step 4 fails) - The source file might be in use (Step 3 or 4 fails) You don't have to deal with every problem individually, but be sure you can deal with a failure in every step and keep your operations consistent. For example if step 4 fails, the user can still decide to keep the destination file. If you try to cross file systems, a move operation might take a long time, or outright fail depending on your language. ### Permissions Permissions should not be copied from one file to another unless there are good reasons (Backups for example). A file should inherit permissions according to what is defined at the destination. In general you don't want to change permissions unless you know what you do. ## Path strings and file names Not all file systems and operating systems can accomodate the same path and file name strings. Unless really needed, you should avoid these: - File names in a directory that only differ between upper- and lowercase - File names consisting of only dots - File names containing characters with special meaning on the command line, for example the ampersand - File names that can't be typed on the keyboard - Very long file names and path strings ## Custom file extensions Custom file extensions should generally be avoided. Only use custom extensions if you actually invented a new file format. You can use custom extensions to avoid users opening the file by accident, but this is no security measure. Someone who wants to know will find out that your ".enc" file is really just a base64 encoded json. ## File system structure If you recursively search through a file system, you should ignore links and junctions. These can be used to construct cyclic structures or create duplicates. Unless you know how to handle these problems, it's best to ignore anything other than real files and directories. ## Many files Every entry in a directory is going to slightly slow down some operations. For example, if you create a file, the system has to run through the entire index of the directory to check for name conflicts. If you can avoid it, don't put too many files into a single directory. Good strategies to avoid this are to combine files into one, or to create more directory levels. You can for example use the SHA1 function to create file names, and then sort them into two levels of directories, so instead of `test\FC35B10C78D44D4A3E61A5B3326CCC33B7189087` you do `test\F\C\FC35B10C78D44D4A3E61A5B3326CCC33B7189087` ## Files as temporary data exchange Files are not the ideal mechanism to pass data from one application to another. For that, you should use another method. Methods available on pretty much all systems include piping data via command line, and network sockets. ## Files as scratch space Do not create temporary files for the sole purpose of passing data from one library to another, unless you want to pass so much data that you risk running out of ram. Use something like `System.IO.MemoryStream` instead. Some functions in libraries that accept file names as arguments will have an alternative version that accept stream resources/handles as arguments. You usually want to use those functions to pass data around and avoid using the file system as a temporary buffer. ## Temporary file location Temporary files should only be created below the users temp directory. To avoid conflicts, your application should not create files directly in the temp directory, but create its own folder where files are kept. Your application should also periodically delete old files from that folder. Finally, make no assumption that a file is still in the temp directory between two launches of your application. ## `/` vs `\` It's no secret that Windows uses `\` while most other systems use `/` as directory separator. While support for `/` has increased in Windows, it's still not the official separator and some functions will fail if you use it. If your language is portable across systems, it should provide functions or constants with the proper slash, or it might silently replace `/` with `\` for you in its file system functions. In general, do not use string concatenation functions to build path strings. Always use the file system functions provided by your language.