A Fireside Chat with Alex Smith of iManage RAVN: Part 2
In Part 1 of this interview, Alex spoke to me about what a huge problem it is to apply machine learning to legal documents written in different languages, or written in the same language but structured completely differently from one jurisdiction to the next.
How is iManage RAVN dealing with these language and jurisdiction problems?
iManage RAVN have taken an approach of putting the ability to train models for use cases into our clients’ hands as in most cases they are the ones who have the data, especially in the contracts arena. The goal is to make available a toolkit that clients can use on the problems they have with the relevant data. So we offer certain ways of cutting up data for training, and methods for how to train and test on that content. It’s a case of us providing tooling that customer can train in whatever language they use for the problems they have to hand.
The issue, I guess, is that firms and in-house teams feel this training should be easy and don’t understand that machine learning relies on inputs and data. This is often due to the hype in the market that miracles are possible and not enough stories of the time/effort/volume of data that goes behind many of the tools. We’re increasingly running educational programmes through our AI University initiative with clients to share the data curation experience the iManage RAVN team have, and to make sure AI and data extraction projects are approached with eyes wide open – especially around the right expertise, methodology, process and testing (as well as the right data).
That’s a very good point – the hype in the market. One of my pet peeves is the understatement of how much data is actually necessary to effectively train a machine learning tool on complex concepts. Why do you think there is such a disconnect between the amount of training that is actually required on the one hand, and the perception on the law firm side of what they think is required?
Why the divide? Law firms generally don’t understand what it takes to deliver these projects (with the exception of some great firms that do) and the curation, training, SUPERVISED learning, and testing that it takes. So over the last few years they have wanted to buy the answer. The press and vendors have over-hyped the outcomes (a subject of an entire other post) and lawyers have been disappointed with the results. We’re correcting now as an industry, the winners will be those that understand the people, process, data and technology aspects of what it takes to succeed and make this work. That requires a holistic approach to the business case, not just hoping tech wins. For vendors, success will be found in creative collaborative partnerships across the evolving legal services ecosystem.
And how does it work when you do the training yourselves, on behalf of a client? Does that look different from when the clients do it themselves?
Again speaking to our lead data scientist: “Where we do the training for clients we don’t reuse generalised models, we build models specific to each language. We do reuse approaches i.e. the algorithm/model architectures to an extent, but some twiddling is usually required to tune for the foibles of each language. It’s certainly not re-invention each time but it’s not quite cookie-cutter either.”
Certain other parts of what many believe to be AI may be more susceptible to common models like entity extraction, but even things like citation recognition or finding legislative references require a lot of localisation and specialist knowledge.
In how many different markets do you have customers? What languages do they speak?
Our client base goes across countries like the U.S., Canada, South Africa, Australia, India as well as clients in Europe like Spain and Holland. There are of course also differences across the English speaking domains (“nations divided by a common tongue”). We provide search tools that are easier to implement across multiple languages as this is a long-running area solved by language packs in the search tool. But on the AI training side for classification or extraction all the issues arise as discussed above.
We continue to expand models and look to test against how they can be applied to content (not just contracts!) and in what languages. The Roman/Latin-based scripts are well developed, but as we expand to clients outside of those we may be back at new models and approaches. The team has run projects over the last 5 years in various languages for various use cases with a lot of lessons learnt.
In many European countries a lot of the big corporate market in a country like the Netherlands is English, of course. Loan documentation, SPAs - they will usually be English or at least have an English translation, lessening the need for the international work.
What is your biggest barrier to entry for a new jurisdiction?
Understanding the legal domain, understanding the content (and approaches/formats) probably all come before understanding the language. We look for commonality with ‘territorial pods’ and close proximity areas, and then use those approaches and ways to train as a base. For areas where we have linguistic experience already the issue is with finding the training sets to expand (see earlier comment about lack of data), and we can put the training tools into the hands of our customers as they often have the actual data. For highly different languages where we have less experiences, it comes down to accessing the data and then understanding what techniques and approaches to work with and train on that data.
Do you think that non-English speaking jurisdictions are currently underserved by legal technology vendors / products?
Is it wrong to say that legal AI follows the money and the international use of the English language? I don’t think they are underserved as it’s hard to transport across even English jurisdictions so we’re not always benefiting from ‘one language’. It comes down to the size of the legal market and the desire to invest in either the technology to build products or, increasingly, taking a platform (like RAVN) and investing in training the models to serve clients - or if open data exists to build products onto of these platforms, rather than rebuilding from scratch. These platforms may be legal tech or may be wider than that – see announcements around Google and Microsoft being interested in this area.
What do you think the future looks like for the industry? Do you think consolidation needs to occur?
Yes – we’ve been at this legal ‘contract AI’ since c 2010 (almost a decade now) and the wider ‘legal research content’ since the 1970s. We’re seeing wheel reinvention everywhere … if I were a global law firm looking to provide data driven legal services to clients who don’t care about the tech but do care about the outcomes I’d be confused right now, trying to work out how I ‘evenly’ provide a consistent legal service to clients wanting an international experience. We’ll see firms move from ‘playing’ with multiple tools to choosing a few key platforms and working with those platforms to crack these difficult questions. Some of these platforms already exist and are being used by firms and their bogeyman, the Big 4.
To find out more about iManage RAVN, visit imanage.com.