Menu

Hints for Graduate Students

On reproducibility, collaboration, and credible data analysis

My best advice, and the best advice of others, about this was published in The Political Methodologist a couple of years ago in Six steps to a better relationship with your future self.

Concerns about lack of good lab hygiene in the social sciences are only growing: more and more journals are requiring that you submit code and data before publishing your article. See for example this new organization . Even though graduate school will be a very busy time, it is worth investing time in learning good research practices now.

On Tools

If you plan to do much statistical computing, I highly recommend Unix-based hardware. I use Apple hardware these days because I wanted a closed system that I couldn't break (in the way I have long spent hours breaking Linux systems). You will learn a lot more about computing if you install and maintain your own Linux distribution. You may get more academic work done if you get a Mac or can exert the self-control required to treat a nice Ubuntu/Fedora/(other preconfigured, ready to run, widespread Linux distribution) as a Mac. That is, you may get more academic work done if you do not recompile your Linux kernel.


Once in a long while Jeff Gill or I update a text document which we call "How to turn your Mac into a Scientific Workstation" . Now that we have it on Github, we encourage others to fork and submit pull requests.


Text Editor Update: The advent of relatively cheap massively parallel computing (like Amazon's EC2 system, or even more cheaply, the free clusters on your campus), has lead me back to Vim from Emacs. If you anticipate that you will be spinning up many cores for some of your computing jobs, then you might invest some time into a text editor that does not require a mouse or GUI because you will find yourself spending a lot of time on the command line in a terminal window connected to remote machines. I currently use MacVim (installed via brew) and the Terminal on my Apple laptop and then vim and tmux on whatever remote Unix system I am using to run jobs. [I know that you can use Emacs without a mouse. I am just more comfortable with Vi without a mouse.]

Version Control

I used to use subversion. Now I use the Git system and especially the Github infrastructure for managing projects now. Github offers some free versions of paid accounts for academics if you ask nicely. When I have needed more private repositories than those available on Github, I have also used BitBucket .

Working with Subversion

Checking Status and Updating Multiple Working Copies

Say you have a bunch of directories, each of which contains a working copy of a different repository. When you sit down to work at this machine you might want to ensure that all of your working copies are updated. Rather than doing this by hand, you can go to the parent directory of all of the working copies (say, "~/PROJECTS") use this bash one-liner:

for X in `find . -maxdepth 2 -type d -name ".svn" | cut -d / -f 2`; do svn update ./$X; done

Shell script novices can note that this one-liner uses the commands find and cut to first find all of the subdirectories which are Subversion working copies (using find ) and then to return a list of the directory names (using cut). The for X in ...; do ...; done part executes the commands between the do and the done for each element in X (which is the list of text elements returned by surrounding the find ...| cut ... piece with `.

Migrating from One Server to Another

If your Subversion repository happened to be running on a 6 year old Linux box built from spare parts, some of which are more than 6 years old, and this machine just so happens to fail, how can you re-instantiate your repository on a new machine?

First Hope That You Have Backed Up the Repository

Here is an example cron.daily script that I used to backup my old bad server repository and to transfer the dumped files to another machine via ssh:
  
#! /bin/sh

cd /home/svn/repos
for X in *; do svnadmin dump $X > /TEMP/SVNDUMP/${X}dump; done
rsync -auvzr -e ssh  /TEMP/SVNDUMP [email protected]:~/
cd
  

Next Create a New Directory Structure for Your Repositories

TODO

Create New Repositories

TODO

Load the Backups into the New Repositories

TODO

Now you are ready to go.

For Clients

Once the new subversion server is running and has loaded the backup files, you need to "switch" the repository to which your working copies point by using the following svn switch command:

cd WorkingCopy
svn info #find the name of the old repository by looking at the URL field and use it below.
svn switch --relocate svn+ssh://oldbadserver.edu/home/svn/repos/WorkingCopy \
                      svn+ssh://newgoodserver.edu/Users/svn/Repos/WorkingCopy

Then you'll want to do svn -u st within each of the Working Copy directories to make sure the switch worked out:
for X in `find . -maxdepth 2 -type d -name ".svn" | cut -d / -f 2`; do svn update ./$X; done
And if it looks ok, you can get back to work.

Troubleshooting

Sometimes, if you've changes a bunch of stuff in the Working Copy while the new server was being re-populated or the server failed after a commit from you but before cron.daily can run, you'll get an error message about "no such revision" or something. This means that you have to check out a new copy of the repository, and to by hand, add stuff from your (assumed newer) Working Copy to it, and then to commit the new working copy with your changes. Say, I had a problem with my BIB directory, the workflow would go something like this:


mv BIB oldBIB
svn co file:///Users/svn/Repos/BIB BIB
diff -ruNp --exclude=.svn BIB oldBIB > file.patch
##(then look at the file.patch to see what the differences were)
patch -p0 -u < file.patch

Then you can do svn -u st and svn commit to get the new changes to the Working Copy uploaded.