Microsoft NTFS Circular Reference Hotfix
Microsoft has identified an NTFS corruption issue that can cause a Windows 2008 server to freeze or hang. The specific corruption that causes the problem is a multi-level circular NTFS reference that Windows self-healing cannot fix. The OS hang caused by the corruption is fixed in Windows 2012 and Microsoft has also released a hotfix for Windows 2008. See Application freezes when an NTFS volume contains a circular directory reference in Windows 7 or Windows Server 2008 R2 for more information from Microsoft or to download the hotfix.
Note that this hotfix does not actually prevent the corruption nor does it fix it. The hotfix will only prevent the OS hang and mark the filesystem as dirty. Once the files system has been marked dirty the next time the server is rebooted an automatic chkdsk will be run and the problem will be permanently fixed.
This bug affects ExtremeZ-IP more than other services because of the way the Mac and therefore the AFP protocol deals with files. Internally the Mac does not use file names or paths to keep track of files, instead it uses a 32 bit numeric file ID to refer to files. This is what allows Mac aliases to continue to work even when the original file is moved. ExtremeZ-IP maintains something called a mapping stream which is a table that maps the Mac 32 bit file IDs to the native Windows 64 bit NTFS file IDs. When the Mac makes a request to open a specific file ID ExtremeZ-IP translates the request and converts it to an NTNative command to open the file using the NTFS ID.
File ids are not just used by Finder aliases but most Mac applications also use aliases internally to keep track of their open files as opposed to working directly with file paths. Most Windows applications and services on the other hand open files by path and not ID. That is the main reason ExtremeZ-IP is much more likely to encounter the bug than many of the other higher level services. That being said other services that need to do low-level file access such as defragmenters and anti-virus scans are also likely to encounter the same problem; however, in general those services do not access the file system nearly as frequently therefore ExtremeZ-IP is likely to hit the corruption first.
The specifics of the corruption are that there is an NTFS ID that has above it an NTFS parent ID which points back to that same ID. If it is single level circular reference (NTFS ID xxxx has a parent ID whose ID is also xxxx) then Windows 2008 self-healing was always able to auto fix it. If on the other hand it is more deeply nested such as a->b->c->a then without the hotfix the kernel could go into an infinite loop and the server would hang.
The reason that the ExtremeZ-IP hangs is because there is a thread that is asking the OS to open a file and that file has circular reference. Once that happens the thread goes into an infinite loop in the kernel. ExtremeZ-IP has as a check that if a thread takes more than 300 seconds (5 minutes) to return it will attempt to cancel it. After 5 minutes we attempt to kill off the stuck thread but by then it is too late. The other ExtremeZ-IP threads which need to do kernel tasks are all backed up behind that single thread and the service can’t even get enough work done to shut down the stuck thread. More information about ExtremeZ-IP stalled thread handling can be found in the following article: http://support.grouplogic.com/?p=3787.