Table of Contents
Guest Post by Basil Alomary
AI has been heralded as the catalyst for a new industrial revolution. While the potential for massive impact is very real, venture investors looking to capitalize on growth ought to spend more time considering the enabling infrastructure.
Although applications are myriad and diverse, from drug discovery to driverless cars, practical adoption in the enterprise has been lackluster. Only 1 in 20 business leaders would describe their companies as “implementing AI widely across the organization.”
The starting point for identifying these investment opportunities is the deconstruction of the AI workflow—extracting each step in the process, from aggregation to deployment and seeking efficiency, scale and access.
An infrastructure-first approach to investing has the potential to yield greater venture returns with a lower risk profile. Looking at the smartphone market, for example, it’s unlikely that an investor in 2005 could have accurately projected that today Google, an internet search engine, would have a mobile business that is 5x larger than Nokia’s. That said, making broad investments in major chip manufacturers would have accurately identified Qualcomm as being a provider whose parts have supported the rise in mobile technology.
Innovations in AI are exciting, but it’s less difficult to identify and bet on, the technologies supporting AI rather than predicting who will provide the voice assistant of the future. The starting point for identifying these investment opportunities is the deconstruction of the AI workflow—extracting each step in the process, from aggregation to deployment and seeking efficiency, scale and access.
What does it mean to operationalize AI?
The process of building and deploying AI-tools can be bifurcated into two steps: training and inference. Training is the process by which a framework for deep-learning is applied to a dataset. That data needs to be relevant, large enough, and well-labeled to ensure that the system is being trained appropriately. Also, the machine learning models being created need to be validated to avoid overfitting to the training data and to maintain a level of generalizability. The inference portion is the application of this model and the ongoing monitoring to identify its efficacy.
For this evaluation, the three most interesting stages are acquisition, preparation, and implementation, as they’ve arguably garnered the least amount of investor attention.
The AI/ML Development Lifecycle
Within the aforementioned stages of development, we can envision a more comprehensive development lifecycle. Those stages are as follows: data acquisition, data preparation, training, inference, and implementation. For this evaluation, the three most interesting stages are acquisition, preparation, and implementation, as they’ve arguably garnered the least amount of investor attention.
The training process is dependent on data that is appropriate for the defined business objective. Where do businesses that are developing internal models acquire this data? For some, this data is internal customer data. This is particularly relevant for large consumer companies that have been collecting data for some time. Using historical customer data is, generally, an inexpensive proposition but can come with its issues concerning data cleanliness and completeness.
What do companies without historical datasets do to train their models? They either lean on publicly available datasets or they purchase data directly. Providers like Narrative are emerging that are primarily focused on selling clean, well-labeled datasets explicitly for machine learning use cases. As of now, the market remains relatively fragmented and difficult for organizations to get the data they need. Whether or not Narrative will be successful is difficult to assess, but their marketplace-driven solution is likely the model that will be successful. Platforms like OpenML and Amazon Datasets have marketplace-like characteristics but are entirely open source. This will always create a barrier for some data providers who are insistent on monetizing their datasets.
Data preparation, irrespective of whether the data is internal to an organization or purchased, is critical to training effective machine learning models. As described in Data Mining: Practical Machine Learning Tools and Techniques, “Preparation involves preprocessing the raw data so that machine learning algorithms can produce a model—ideally, a structural description of the information that is implicit in the data.”
A dataset of a thousand traffic images may have the presence of a particular characteristic, say a “Stop” sign noted, but verifying every image is an enormous task. This task, known as data labeling, becomes even more monumental when looking at even larger datasets.
To the right of above paragraph:collage of of stop signs or traffic images Today, platforms that facilitate data labeling can support several parts of the process: raw human input to label data, facilitation of collaboration, data management, or serving as a layer of automation that accelerates the labeling process. There are several competitors in the space all looking to manage data preparation from end-to-end and are starting to reach into other parts of the process as well. Labelbox for example recently raised a $25mm Series B from Andreesen Horowitz. Additionally, companies like Hive, Cloudfactory, and Scale AI are all competing in the space.
The implementation component of the AI development lifecycle is complex and not only encompasses the deployment of the model into the real world, but also ongoing evaluation. To do so requires building a data pipeline that can handle continued training, scaling and managing computing resources, developing version control and integrating monitoring tools.
Prudent investors, however, should seek innovative companies that are driving the unbundling and focusing on aspects of the deployment process that are especially painful for businesses.
Amazon and Google have recognized this need to support enterprises that are looking to deploy AI-driven applications and have developed an ecosystem of relevant tools. With platforms like Sagemaker, Amazon is not only providing a tool to help facilitate deployment, but also an entire managed service that includes human interventions to monitor deployed models. Given the nascent nature of the space, there will likely be an unbundling. Meanwhile, most companies are looking to tackle things end-to-end, like Algorithmia and Dataiku.
Looking Forward To Opportunities in the AI Infrastructure Space
Prudent investors, however, should seek innovative companies that are driving the unbundling and focusing on aspects of the deployment process that are especially painful for businesses. This could be as simple as more accurate monitoring of deployed models, something that may be easily overlooked but provides tremendous value to customers.
As new teams and processes emerge to develop and deploy AI more widely, there will need to be platforms that support operationalizing them. These platforms represent a phenomenal opportunity for investors and they should continue to deconstruct the AI/ML development cycle to look for investments.
Basil Alomary is a second-year MBA candidate at Columbia Business School and MBA Associate at Primary Venture Partners. His background and experience are in early-stage SaaS, both as an operator and investor.