How to Backup and Store your Next-Generation Sequencing (NGS) data


Ctrl + z, Ctrl + z, Ctrl + z, Ctrl + z, Ctrl + z, Ctrl + z, ………oh crap!

Congratulations!  You have recently received a file path to retrieve your hard-earned next-generation sequencing data.  You quickly transfer the files to the computing cluster you work on or perhaps, if you only have a few lanes of data, to your own computer.  But before you begin messing around with your data, you quickly realize that you should come up with a plan to back up and store unadulterated versions of your files.  This is a good decision for the following reasons, even if you are very experienced with the command line:
1.) It can be remarkably easy to delete your files (or the entire directory that they are stored in).
2.) When beginning to work with data, you could accidentally combine, misplace, rename, move, delete rows, reformat, or even overwrite files.
3.) The hard disk that the data are stored on could crash at any minute.
Now all of this may sound a little alarmist (and that is intentional), but with a few simple steps you can enjoy a peace of mind while working with your data (or going to sleep at night).  Below, I will outline the steps that I have used when working with NGS data.

It is important to keep in mind that there are a number of ways to store and backup your data – and it would be great to hear different points of view below in the comments.  I use Linux Mint on my laptop and the bioinformatics cluster at Oregon State University runs CentOS.  If you use a Mac, the terminal commands should be identical.
Step 1:  Make and keep your data as read-only.  In Linux it is quite simple to change the file permissions for any given folder or file.  When you have finished copying your files, I strongly recommend changing the entire directory to be read only.  Of course, your data may already be only read-only, so the first step is to determine what your file permissions are set to.  If you do have write and execute abilities, you can change access by using the chmod command.  You may be tempted to restrict access from other users and to leave write and execute permissions for yourself.  I strongly discourage this practice as the greatest threat to your own data is really yourself.  For a given file, chmod 0444 example.file should work.  Make sure you peruse the chmod manual page, as you could accidentally revoke all of your file permissions such that you need root privileges (which you may not have) to regain access.  I suggest simply creating a temporary directory with some simple files and testing out your chmod choices (or use this tutorial).  Note that your data may arrive in a compressed format, such that you will need to change your permissions temporarily in order to uncompress the data.  I recommend that you do not do this until you have your first backups in place.
Step 2:  Tar your files.  Tar stands for tape archive and is an incredibly useful tool.  Importantly, it keeps the directory structure of your folders, file permissions, system info etc. etc.  You can also compress your files at the same time, which can save a fair amount of storage space.  Before you start, definitely take a look through tar –help.  Here is a simple example of how tar works.  I have created a simple example directory (named “example”) with 5 files and 2 subdirectories with 1 file each.  Next I want to create a new archive, so I type:
tar -czvf example.tar.gz ./example,
where c = create new archive, z = use gzip to compress files, v = verbosely list files being archived, f =archive file, example.tar.gz = the file name of the new tar file (.tar.gz are the standard file extensions for tarred and zipped files), ./example = the directory I want to archive (issuing the command from directly outside the folder, you could also use the absolute file path if you so desire).  To untar your file simply use
tar -zxvf example.tar.gz.
Here is a screen shot of the above example:
How to tar and un-tar an example directory.  After creating the archive, I intentionally delete the folder to illustrate how you wold restore this folder if you lost your data.

How to tar and un-tar an example directory. After creating the archive, I intentionally delete the folder to illustrate how you wold restore this folder if you lost your data.

Step 3: Backup your files.  Now you have a nice archive of all of your files.  If space isn’t too limited, I would keep this .tar.gz in your working directory.  More importantly, however, is that you need to place at least 1 copy of this file on at least one additional independent hard disk that is ideally not located in the same building.  We recently purchased a 5 TB RAID NAS (how’s that for acronyms!) and it has been very useful for this purpose.  There are lots of hardware choices out there, but in general you want to consider redundancy and ease of use.  If it is not easy to use, you will be less likely to use it. If you don’t have too many files, several good-quality external hard drives designed for backing up data may also suffice, but the lack of redundancy seems a little troublesome.  I’d be curious to get other opinions on this subject, as there seems to be wide variety of practices adopted by labs in this area.
Step 4: Re-tar and backup your files regularly.  As you continue to work, you will undoubtedly (1) write scripts, (2) create intermediate files, (3) create output files to be exported for data analysis.  In short, you will create lots of files and folders.  Efficient pipelines will strive to save important code, but will eliminate bulky intermediate files from long-term archiving.  One nice feature of tar is that you don’t need to re-tar files that you already created (i.e., you can create monthly addendums to your tar files).  Unfortunately, you cannot update a zipped file.  Using the example from before, I have now added a new folder and a new file.  To update the archive I use:
gunzip example1.tar.gz
tar -uvf example1.tar ./example
And voila, the archive has been updated.  Updating will save a lot of time and wasted computing resources by not continually re-archiving (and compressing) your huge NGS files. Occasionally, it may be a good idea to create entirely new archives, particularly if you have added large amounts of new data.  I recommend updating archives monthly, or more, depending on your level of activity.  Also, because most of your time is actually spent creating and editing scripts, I recommend creating additional backups of your scripts to your own computer, additional hard drives, dropbox, github etc.
Lastly, I am not sure if you would want to do this on a Mac (I think they come with their own software), but I use the exact same steps to back up all my documents, photos, music etc. on my Linux laptop.  I then place these backups of my entire computer on Dropbox and modern external hard-drives.   Now if only data analyses were so simple…

This entry was posted in bioinformatics, data archiving, genomics, howto. Bookmark the permalink.