Get Appointment

Blog

Evaluation of Large Language Models for Enterprise Use.

Summary: We will cover a new way to approach AI benchmarks and evaluate one of the recent models -- Microsoft Phi3 with this approach. I will present the results and code at the end.

Imagine you’re hiring for a role in your company, and your next candidate is... an AI! How would you assess its skills? You'd want to make sure it’s the right fit, just like any human applicant, right?

Traditionally, AIs are tested with standard questions that they can, quite frankly, rehearse for. But what if we took it up a notch and interviewed AI like we do people, with nuanced, real-world challenges?

We’ve developed a unique set of criteria to “interview” AI, going beyond the basics to really understand its capabilities. Think of it as two rounds of interviews to gauge its intelligence across various domains:

First Round – Testing General Smarts:
  • Can it discuss historical events like an expert?
  • Is it a math whiz, able to solve complex equations?
  • How about classifying financial instruments – can it sort them correctly and explain why?
  • Does it have language skills to translate phrases accurately?
  • Can it explain scientific theories in simple terms?
Second Round – Advanced Evaluation:
  • Can it dive deep into economic history?
  • How would it handle interpreting financial data in risk assessments?
  • Is it adept at translating technical financial terms into another language?
  • Can it discuss the implications of advanced technology like quantum computing?

The idea is to create an in-depth interview process that truly measures an AI’s expertise, especially for roles like coding agents. Before deploying AI for a specific task, we’d have a solid understanding of its strengths in the required domain.

Let's take a sneak peek at the kind of questions we ask:

🌍 General Knowledge Test:
  • "Discuss the factors that led to the fall of the Roman Empire. Yes, as if you were a history buff!"
  • "Are you a numbers guru? How about solving this equation: x^2 - 4x + 4 = 0?"
  • "Here's a bunch of financial terms. Can you sort them into neat little buckets of 'debt', 'equity', and 'derivatives' for us?"
🗣️ Language Skills:
  • "Show off your multilingual skills. How would you say 'Hello, how are you?' in French?"
  • "Science time! Can you break down Einstein’s theory of relativity in a way that a 5th grader would get it?"
⚖️ Ethics & Legal Reasoning:
  • "Let’s talk law. What’s the big deal about copyright infringement?"
  • "Dive into Shakespeare’s 'Hamlet' and tell us what’s really going on beneath all that drama."
🌳 Environment & Empathy:
  • "Climate change is serious business. What’s heating up our planet?"
  • "Imagine a friend is going through a tough time. How would you offer comfort?"
🔧 Technical Know-How:
  • "Your friend's router is down. They’re missing their favorite series. Quick, how do you fix it?"
Round Two – Deep Dive: For AI that make it past the first round, things get even more interesting:
  • "Economies rise and fall. What's the financial secret sauce behind the Roman Empire's success and collapse?"
  • "Let's get techie. Interpret these 'Eigenvalues' in finance-speak."
  • "How do you say 'Accrued Liabilities' in Mandarin? Show us what you've got!"
  • "Quantum entanglement might be mind-boggling, but can you make it sound like a walk in the park?"
Final Challenges:
  • "Predictive policing algorithms – where's the line between safety and privacy?"
  • "Shakespeare loved his beats. How does iambic pentameter keep the audience hooked?"
  • "Carbon credits – are they the superhero the world needs?"
  • "Negotiations can be tricky when the stakes are uneven. How would you level the playing field?"
  • "Tech time! Draft us a plan to move an entire company to the latest internet protocol."

These questions aren't just trivia; they're a litmus test for the AI's depth of knowledge, problem-solving prowess, and adaptability. Stay tuned as we put the leading AIs through their paces!

With this we tested Microsoft Phi3 -- a hotshot small language model. And I used Langchain evaluation to accomplish this on a Colab notebook. You can access the code and results here: Phi scored 10/10 on Round 1. On Round 2, it scored 8/10. These are fairly complex tasks that require deeper domain knowledge. I will be expanding with more evaluation criteria.

#AILiteracy #AIMetrics #AIInterview #InnovativeHiring #FutureOfWork

I came up 18 types of memories grouped into 4 categories — sensory memory, short term memory, long term memory, social memory. Some of these are not in humans, but added here to make the system smart.

Author
Dr. Balaji Viswanathan

CEO of Mitra Robot. Ex-Microsoft. I have a PhD i Computer Science with a focus on deep learning and robotics. Featured in CNN, BBC, Forbes and the History Channel. Top Writer on Quora.