A Fireside Chat with Alex Smith of iManage RAVN - Part 1
Alex Smith is the Global Product Management Lead for iManage RAVN, a cutting-edge Artificial Intelligence platform that powers a number of applications to automatically organize, discover and summarize your documents.
A month or so ago, Alex and I sat down to chat about a number of legaltech topics. The seeds of this interview were planted then, and grew over the following weeks. Alex had so many insightful views into this subject that I have split the post in two. Part 2 will be published on Thursday, August 15, to coincide with the official launch of the Tower of Babel blog.
Is there a language problem with AI or am I imagining things?
I’d start at the highest overview here … legal is a very difficult domain, like really difficult. To paraphrase a famous quote “Legal is difficult. You just won't believe how vastly, hugely, mind-bogglingly difficult it is. I mean, you may think it's a long way to AI a contextual ad in Facebook, but that's just peanuts compared to legal language” (quote is from my favourite writer and literally the only futurist ever … Douglas Adams).
Why is legal difficult … because lawyers don’t actually use their own languages, they use legal versions of own language. For example, pretend to be a lay person and try reading a contract … what does it actually mean? Would you define that as ‘common language’?
And sorry, I’m not going to just stick here to the flavour-of-the-decade - contracts - but also, look at what it’s like to read judgments, legislation and regulations. They all have their evolved or official ‘codes’ of language, which means that applying generalised ‘out of the box’ ML isn’t going to work.
Interestingly, to date attempts at machine learning in legal seem to be focused on the more ‘narrative’ aspects of legal content like case reports/headnotes and contracts - but especially the latter. I see less work looking to machine learn content in legislation and regulation – and am aware of some serious failures at a large scale in this area.
Whilst we’re discussing language in AI, one of the first starting points is structure in legal content. Understanding structure and what you’re looking at is key to breaking it down into a state that you can target the right machine learning at. And here’s the rub: each country, jurisdiction or even state often has different formats for similar types of content. So for example a contract format in one country has little commonality with another jurisdiction – look at the differences between UK contracts with a highly structured bulleted approach to US contract ‘essays’ with paragraph-based text. Same with the difference between US/Canadian case reports with highly structured headnotes and judgment formats versus the commonwealth approach to wordy judgments and ‘essay’ headnotes.
Finally, not every country has the same approach to legal transactions … for example a lot of Japanese contracts are very light-touch due to cultural norms of business honour and so bear little structural similarity to ‘international’ contracts. And for example China has no concept of case law precedence or authority at all.
Finally, finally … some of this could be globally dealt with if we had huge amounts of data and huge amounts of computing power but legal is a small data domain and so many of the data sets are in private isolated domains, for very obvious reasons. This really is important. For narrow problems a limited data set may be fine, especially when you’re trying to extract data, but if you’re looking at solving the big problems in legal data across jurisdictions, there isn’t really the amount of data you need. Ally that to the lack of publicly available legal data that should be accessible (legislation, regulation and cases) and we have a data problem.
How big a problem is it? How hard is it too overcome?
It’s huge … where there are similarities in content and similar history, certain parts of ‘AI’ can be transferred - but it relies on a commonalty of content types, of linguist legal approaches and of available data. The ability to ‘transfer’ learning is possible but often requires ‘retraining’ if in a near proximity area, or just starting ‘from scratch’ where the domain is different. So an example may be where there is similarity in format. Approaches to mining structured data in US cases may have transference to Canadian cases due to similar approaches to the format of legal content (eg transcripts and headnotes), but may struggle to be useful in a commonwealth jurisdiction where headnotes and judgments are more ‘essays’ (sorry to say) than the structured approach in North America. So finding fact patterns, parts of judge speech and all the things important in US legal research may not be plug and play into another jurisdiction – even an English one for the reasons I have gone into.
Speaking to our lead data scientist at RAVN his response was: “It is a big problem. I think the question is more what of it do we want to overcome. For example, we don’t need to create AI solutions that will work for absolutely every language, that would be crazy and nearing impossible. Now if we could make one that dealt with English, Spanish, French, German in one then that’d be cool, and this is sort of possible with the state of the art today but would require significant resource to operate”.
So what does one do about this?
What we see is that it’s often easy to learn from an existing approach, and then retrain. Most of what we see today is Supervised Machine Learning anyway so providing the tools and methodology to retrain for a new language is often what a platform should be able to deliver. Given the lack of access to public data in many of the domains, this is the way users of the technology have to head if they want to even attempt to learn ‘across’ languages. Where we may get to is ‘territorial pods’ (have to thank a former colleague for that term) where there is a possibility of sharing actual algorithms rather than ‘just’ approaches and tooling to train and learn. But you still need the data to train on and then you have to test to the levels of accuracy that meet your use case or the problem you’re trying to solve – which is usually pretty high when dealing with lawyers.