Data scientists spend weeks and months not only preprocessing the data on which the models are to be trained, but extracting useful features (i.e., the data types) from that data, narrowing down algorithms, and ultimately building (or attempting to build) a system that performs well not just within the confines of a lab, but in the real world.
Salesforce’s new toolkit aims to ease that burden somewhat. On GitHub today, the San Francisco-based cloud computing company published TransmogrifAI, an automated machine learning library for structured data — the kind of searchable, neatly categorized data found in spreadsheets and databases — that performs feature engineering, feature selection, and model training in just three lines of code.
It’s written in Scala and built on top of Apache Spark (some of the same technologies that power Salesforce AI platform Einstein) and was designed from the ground up for scalability. To that end, it can process datasets ranging from dozens to millions of rows and run on clustered machines on top of Spark or an off-the-shelf laptop.
Mayukh Bhaowal, director of product management for Salesforce Einstein, told VentureBeat in a phone interview that TransmogrifAI essentially transforms raw datasets into custom models. It’s the evolution of Salesforce’s in-house machine learning library, which allowed the Einstein team to deploy custom models for enterprise clients in just hours.
“It’s informed by what our data scientists learned while building Einstein,” Bhaowal explained. Chief among those lessons: Custom-built models beat global, pretrained models. “If you’re using the same model to make predictions for a Fortune 500 company and a mom and pop shop, you’ll have a hard time finding the right pattern.”
Machine learning made easy
TransmogrifAI offers a three-step workflow.
First is feature inference and automated feature selection. It’s a crucial part of model training, as selecting the wrong features could result in an overly optimistic, inaccurate, or biased model.
Using TransmogrifAI, users specify a schema for their data, which the library uses to extract features automatically (such as phone numbers and zip codes, for example). It also performs statistical tests, automatically cataloging text fields with low cardinality — i.e., a small number of elements — and throwing out features with little-to-no predictive power, or those that are likely to result in hindsight bias (the tendency to overestimate an event’s predictability) and other unwanted signals.
In a demo, Bhaowal showed how TransmogrifAI could quickly isolate features like job titles, emails, and addresses and figure out whether they’re predictive. Those that aren’t — salutation, in this case — were discarded automatically. “It’s perfect for dimensionality reduction,” he said, referring to the process of reducing the number of features on which the model is trained.
The next step in TransmogrifAI’s flow is automated feature engineering. Drawing on the feature types extracted in the first step, the library transforms structured data into vectors, automatically taking, for example, a list of phone numbers and splitting out the country code to see if a phone number is valid.
Once TransmogrifAI has extracted features from the dataset, it’s primed to begin automated model training. At this stage, it runs a cadre of machine learning algorithms in parallel on the data, automatically selects the best-performing model, and samples and recalibrates predictions to avoid imbalanced data.
Core to TransmogrifAI’s training is what Shubha Nabar, senior director of data science for Salesforce Einstein, calls “model explainability” — transparency about the factors influencing a model’s predictions. “From a trust and data privacy perspective, it’s important that the generated model isn’t a ‘black box’,” she said. “[TransmogrifAI] shows the global effects of each feature.”
And that’s just the tip of a very tall iceberg.
TransmogrifAI boasts tools that make it easier to adjust hyperparameters — variables such as sampling rate and filters — that influence and optimize machine learning models. And within integrated development environments that support it, TransmogrifAI highlights typos and syntax errors, suggests code completion, and “types” features with an extensible hierarchy, allowing users to differentiate between nuanced and primitive features.
“[TransmogrifAI] has been transformational for us, [reducing] the average turn-around time for training a performant model to a couple of hours and enabling our data scientists to deploy thousands of models in production with minimal hand-tuning,” Bhaowal said. “The goal of democratizing machine learning can only be achieved through an open exchange of ideas and code, and diverse perspectives from the community will make the technology better for everyone.”
Coincidentally, the public launch of TransmogrifAI comes a day after the open-sourcing of Oracle’s GraphPipe, a tool that makes it easier to deploy machine learning models made by frameworks like Google’s TensorFlow, MXNet, Facebook’s Caffe2, and PyTorch in the cloud.