top of page
  • Writer's picturePrajith Raju

Behind the Scenes- A Look at How Betty Works

Updated: Jul 15

At a party I was at the other day there was some furious debate amongst folks around the black box-ness of chatGPT and whether it should reference its sources, similar to wikipedia or other sites we’re more used to. What they failed to understand is how chatGPT works behind the scenes. chatGPT or the underlying LLM (large language model) has for years taken a TON of data and learnt from it, resulting in the model that chatGPT uses today. All chatGPT is doing is trying to give you an answer based on its computations. It’s just thinking of what the next best word to put in front of you is based on a very complicated algorithm. It certainly is not copying a piece of text verbatim in most cases from somewhere and showing it to you (unless that’s what the prompt asked for).


Here’s another example, let’s say we’re building a model to detect from a given image whether it’s a dog or a cat. The model gets trained on several thousands of pictures. Each picture was previously labelled as cat or dog and fed to the algorithm.


The algorithm detects key features — nose, whiskers, forehead, jaw and puts together a mathematical model that in the future if you give it a new image it can tell whether it’s a cat or a dog. Of course, model explainability is a separate field on it’s own, and the amazing engineers at google seem to have taken a stab at the explainability problem for the dog/cat classification, read more here.



This is very similar to the problem we’re after, we’ve built a model to predict inquiries based on millions of rows of data, while the ML part of the data pipeline might feel like a black box, my intent in this blog is to take you behind the scenes on some of the other parts of the data processing. This blog captures some of the 10s of thousands of hours of work put in by our data science engineers to create Betty.

So you upload a file on Reworked.ai, what happens next? How does Betty know which row to give a higher score suggesting a potential deal / inquiry or which row to score lower?


Various stages of file processing


First, we ensure it’s ready to be processed (File Validation step), we check the columns it has, the format of the data in the columns, etc. then it goes through three main stages.

Stage 1: Pre-process & Data Augmentation

This process takes the longest, we not only enrich missing data, but we proactively go out and hit external vendors to augment more data so there’s enough color on each row for Betty to make an informed decision. This is akin to the features on a cat face and dog face, we need to make sure we have as much of a “full picture” on a property and owner so we can predict accurately. Some of the examples of data we augment are — demographic at a zip code level, property details — when was it last sold, etc.

Stage 2: Rules engine

Now that Betty has figured out what each column means and enriched the data, it’s time to get busy with predictions. In this step we’re simply using the power of reasoning to eliminate obvious ones. For e.g. if someone is doing a mail campaign and a mailing address is missing OR an address is not verifiable, Betty simply reduces the score for this row. This is a step we’re constantly evolving and is personalized based on what our customers are shooting for in terms of their campaign.

Stage 3: ML Model

This is the most critical step as what we do here is use the ML model we’ve built (and continue to improve upon) to predict each row and come up with a score. The ML model has been built over years by smart engineers with several millions of lines of data and counting. What features we use and what specific algorithm we use to build the model is proprietary but we’re constantly adding new features and testing out new algorithms to improve our models.

The ML problem at hand is a supervised, binary classification problem. Because we’re dealing with a highly unbalanced dataset (i.e. in any file number of successful inquiries is orders of magnitude less than non-inquiries), we have a specific suite of algorithms to choose from. Some of the algorithms that work well in this scenario are:

  • Naive Bayes

  • Random Forest

  • KNN

  • Histogram gradient boosting classifier

Once all three stages run successfully, a file with predictions and scores is ready for download and use!



10 views0 comments

Comments


bottom of page