27/07/2011

Making a bibtex file from a folder of pdf files

The issue
As I'm going to be writing some big documents with lots of references I'd be a fool to try to manage these manually, I therefore needed to pick a reference management piece of software. After some browsing I settled on JabRef because: it's free, it's open source, it's lightweight, it's cross-platform and it handles bibtex format natively (which is what I need for it to integrate with latex). It should also link nicely into the Sciplore mind mapping software which I'm using (more about that some other time).

JabRef is basically a database management tool for references that stores its database in bibtex format. It  looks like it will work rather well, but unfortunately my first stumbling block is that I already have a folder full of my references in pdf format (~200). This means that I'm immediately faced with the big task of going through and adding the details of each pdf individually. There must be a better way...

Someone else asked the same question here. The answer seemed to be that there was no easy way in JabRef, but it could be done in some other reference management software - such as Mendeley. So I could install that as well and export from there to use JabRef, that seemed like a pain though, especially as you need log in details and all sorts for Mendeley.

The solution
Somewhere else cb2Bib was suggested. This looks like an awesome piece of software, almost to the point that I could use it instead of JabRef, although I don't think it does quite the same job. It's designed as a bibtex database manager, however it is more tailored towards reference entry than editing or final use (e.g. citations) - although it can do this. Its method of adding a new reference is based on what's currently in the clipboard - thats whatever you most recently 'cut' or 'copied' in your operating system. This can either be a piece of text or a pdf file.

Files from the system can also be queued up to be added to the clipboard for addition to the bibtex database - in this manner a folders worth of pdf files can be added. Once the file is in the clipboard the software interrogates it to try to extract the right details for the bibtex reference entry. It is also able to do some other clever things like search the web and find a web reference for it that matches only one of the pieces of data it has extracted. There is also the option to manually edit the fields or to set off a whole run of files to add automatically.

My implementation
In practice the software took a little while to get used to; the buttons aren't in quite the locations I'd expect, there seem to be about 3 different windows that are independent but interrelated and the method of specifying a bibtex file and then successively saving additions to it felt a little odd (rather than running through to create a file and then saving it all at once). But once I was used to it at that level it all worked.

When I came to actually try to add all of my pre-saved pdfs however, I hit problems. Whilst automatic extraction usually managed to pull out a few nuggets of useful data, it rarely found enough for a complete entry. Hitting the button to search the web didn't seem to give much assistance. So it was time to dig a little deeper.

Probing through the website there is quite a lot of useful information on how to configure the software to do what you want. What I needed to do was look into where was being searched on the web for my articles. This is all setup in a configuration file located at:
C:\Program Files\cb2bib\data\netqinf.txt (windows)
or
/usr/share/cb2bib/data/netqinf.txt (linux) (you'll need permissions or to be root to edit)

Wading into there you can find out where is being searched and in what order. What would have been ideal for me would have been a search of the IEEE Xplore site, as that would have turned up most of my papers. Unfortunately it was not in there. Second best was google scholar, sitting at the bottom of the list of options. The documentation in the file wasn't brilliant, but with a bit of trial and error I was able to work out what was going on.

The major change I made to the file was to add this at the top of the queries list:

# QUERY INFO FOR Google Scholar
journal=
query=http://scholar.google.com/scholar?hl=en&lr=&ie=UTF-8&q=<<title>>&btnG=Search
capture_from_query=info:(.+):scholar
referenceurl_prefix=http://scholar.google.com/scholar.bib?hl=en&lr=&ie=UTF-8&q=info:
referenceurl_sufix=:scholar.google.com/&output=citation&oe=ASCII&oi=citation
pdfurl_prefix=
pdfurl_sufix=
action=


journal=
query=http://scholar.google.com/scholar?hl=en&lr=&ie=UTF-8&q=<<excerpt>>&btnG=Search
capture_from_query=info:(.+):scholar
referenceurl_prefix=http://scholar.google.com/scholar.bib?hl=en&lr=&ie=UTF-8&q=info:
referenceurl_sufix=:scholar.google.com/&output=citation&oe=ASCII&oi=citation
pdfurl_prefix=
pdfurl_sufix=
action=

The important changes here are the <<title>> and <<excerpt>> search strings, and the change from capture_from_query=info:(\w+):scholar in the existing scholar searches to capture_from_query=info:(.+):scholar in my search. I'm not too sure what the latter change did, but its effect was that it found the details - where previously it was often missing them!

The other change I made was to untick the option "Set 'title' in double braces" box in the configuration window. After I'd made these changes it worked a lot more consistently.

Some of the time it still pulled out the wrong details if it mis-extracted the article title, however I'd named all my pdfs with the title of the paper, therefore it was simply a case of copying and pasting the filename into the title field and rerunning. It would have been really nice to be able to use the title of my pdf as part of the search but unfortunately I couldn't find a way of doing that.

The only other issue I'm having is that although cb2bib adds in the link to the pdf file, JabRef wont understand it as it uses a very slightly different bibtex format for it. The cb2bib format seems to be:
file = {location}
whereas the JabRef format seems to be:
file = {description:location:type}
There is a comment here by a Mendeley admin that suggests that there is no prescribed format for this aspect of a bibtex file, so I guess it's to be expected. I should be able to work around it with a bit of clever find/replace, but it's an annoyance.
ACTUALLY - this seems to be working under windows! It looks like a different version of JabRef has gotten around this issue.

UPDATE: After a couple of months of getting used to cb2bib and using it to produce a document I'm not really finding the need to use JabRef at all! The 'citer' facility of cb2bib is actually really good.

UPDATE: I hadn't previously gotten round to extracting from IEEE Xplore, as almost everything is on Google Scholar. However I've just tried to set it up and found that the IEEE pages use javascript buttons to produce the citation. This makes it difficult to fully automate.


If you add the following to netqinf.txt then it should search IEEE Xplore for the title, you can then manually click the "download citation" button, select BibTeX format and then copy the BibTeX citation into cb2bib:

# QUERY INFO FOR IEEEXplore
journal=
query=http://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=<<title>>&x=35&y=7
capture_from_query=arnumber=(\d+)&contentType
referenceurl_prefix=http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=
referenceurl_sufix=
pdfurl_prefix=
pdfurl_sufix=
action=browse_referenceurl

14/07/2011

Latex \psfragfig of figures in other folders

What I've spent part of today wrestling with...

I've detailed previously how I'm exporting Matlab plots and including them in Latex documents. This works very well for figure files in the same directory - which is kind of the way Latex is set up. Unfortunately I'd like to maintain my folder structure in a different manner, for example, I have a "results" folder and a "thesis" folder both at the same level. Within "results" I have my Matlab plotting scripts and the images that are automatically saved by them. In my "thesis" folder I have my Latex files, in which I would like to include my results images.

Ideally I would be able to simply reference the image in my Latex document but that's not quite how it worked out. Let me explain what I tried...

Using \graphicspath
This is the most obvious way of doing things as noted here, and it works fine for graphics inserted with the \includegraphics command. I was also able to reference relatively like this: \graphicspath{{../results/}} and apparently it would also look in all the subdirectories if I put a double slash on the end: \graphicspath{{../results//}}.

This would be a great solution if I were using the \includegraphics command - but I'm not. I'm using \psfragfig and it doesn't work with this command, I guess because it's not set up to?

There is also a suggestion here that this method shouldn't be used, in favour of adding the images directory within your Latex compiler setup - although I'm not sure this would be appropriate for what I was trying to do.

Using \input, \include or \import
These all do similar things, in that they allow you to bring in latex documents from elsewhere in a folder structure. Only some of them will allow relative linking via higher level directories (rather than just subdirectories). \import seemed to be what I needed (or rather the \subimport variant of it did), so I set up a .tex file containing only my \psfragfig command within the "results" folder and imported that into my document. This got the figure into the document, but it didn't do all the nice text replacement that it was supposed to. In this respect it was no better than the previous technique.

After some experimentation it seems that \psfragfig only looks locally to the master document for the .tex files that it uses to include text with the figures.

Using \write18
So it appeared that the only way I was going to get the \psfragfig command to work properly was by having both the "figure.eps" file and the "figure.tex" file local to the master document. I'm not keen on the unnecessary duplication involved in this, so at the very least I decided to make it automatic. I therefore set it up so that these two files are automatically copied to the local directory during compile. This means that I always have the most up to date version of them in my document, and everything stays nice and automatic.

I think the only way to copy a file around between directories from within a Latex document is by issuing a system level command using \write18. This is usually disabled by default in Latex as it has the possibility of really mucking things up in your system if you were compiling 3rd party code. Therefore I had to compile my code with the extra argument '--shell-escape' as detailed here.

So I want to use the Latex \write18 command to execute a DOS copy command, that's fine; but there is also the tricky issue of needing to use backslashes for the path in the DOS copy command - these are obviously reserved in Latex. Therefore I had to use the technique outlined here to get around this.

So finally I managed to put together a Latex command that works, and it looks like this:
\def\psfragfigremote{\begingroup\catcode 92 12 \execB}
\def\execB#1#2%
{\immediate
\write18{copy #1#2.eps}%
\write18{copy #1#2.tex}%
\psfragfig*{#2}
\endgroup
}

So after including that in my header I can reference figures in other directories in my document like so:
\psfragfigremote{..\results\}{figure}
This copies both figure.eps and figure.tex (generated from Matlab with the matlabfrag command) from the results folder into the current folder and then includes them in the document.

It works fairly well, but not the first time it is run. I'm not sure why this is, possibly it hasn't finished copying the .tex file before starting to execute the next command. Therefore two compiles may be necessary to get all the labels sorted out.

I'm sure there must be a better way of doing this, possibly modifying the \graphicspath and \psfragfig commands so that they play well together would be a neater solution, but this works ok for now.

I hope this is of use to someone. Any improvements or suggestions then please comment.

UPDATE: Replacing the \write18 commands with:
\write18{robocopy #1 . #2.tex #2.eps}%
and removing the star after \psfragfig will only copy the files if they have changed, speeding up processing significantly.

11/07/2011

Google Chrome Extensions

A quick post to note down some of the extensions I'm using for Google Chrome. I switched over to Chrome as my browser of choice around a year ago (from Firefox) and I'm still very much enjoying the experience. The extensions I use are probably what make that experience good, so I thought I would give some kudos to them.

AdBlock - Blocks adverts that I don't want to see, brilliant
Add to Amazon Wish List - Allows me to keep track of stuff I'd like to buy
Backspace As Back/Forward for Linux - Does what it says most of the time, but doesn't seem to work 100% of the time - better than nothing.
Better Gmail (unofficial) - Cleans up my gmail inbox a bit
Context Menu Search - This is awesome, it allows me to highlight text and search for it in the websites I usually use. I use it most for Google Maps and internet shopping (Amazon, eBay and Google Products).
Facebook - Lets me keep on top of that all important social networking.
Google Calendar Checker - Counts down the time to my next event and lets me go straight to my calendar with one click.
Google Calendar ebay reminder - I had to hack this one a bit to make it work, but it adds a link to ebay auction pages allowing me to add the auction end as an event in my calendar.
Mail Checker Plus - An icon showing how many unread messages I have and a quick link to my inbox.
Use HTTPS - Keeps my browsing secure
KB SSL Enforcer - More security stuff
Shareaholic - Lets me email a link to the current page, or makes a shortened URL for it.

As you can see from this list, most of them just cut down the number of buttons I need to press to get stuff done - what can I say, I'm lazy!

Anything else that I might find useful?