Did you see the graph benchmark? I found it quite interesting. It had to do a graph traversal on a natural text representation of a graph. Pretty much your problem.

▲

stopachka 3 hours ago | parent | next [-]

Update: I took a corpus of personal chat data (this way it wouldn't be seen in training), and tried asking it some paraphrased questions. It performed quite poorly.

	▲	abraxas 2 hours ago \| parent [-]
		Which models did you try?

▲

stopachka 5 hours ago | parent | prev [-]

Oh, interesting!