From Web2 to Web3: Why I'm Bullish on the AI Track

Author: Zixi.eth, Matrix Partners China Investor Source: X (formerly Twitter) @Zixi41620514

Recently, I have begun to focus on the Web2/Web3 AI track, the open source model community in the global model track, the data track, and various middleware serving the large model - such as the full-process service for the Foundation Model into the industry model, and some Applications. We welcome all kinds of entrepreneurs to communicate with us, we believe that AI will be a long-term track.

In the first issue, I will share that the data labeling industry in the data track that we have recently laid out is also a very satisfactory target for me this year.

AI development can be divided into data preparation with data collection, cleaning, annotation, and enhancement processes as the main body, and algorithm development with model construction, training, tuning, and deployment as the main body. Among them, due to the diversified needs of AI in the new era for data, such as multimodality, high precision, and strong customization, the dependence of AI data on human labor in the new era is also very high, and it is also necessary to further improve the smooth interaction between AI and people to increase efficiency. Data labeling refers to the identification and differentiation of feature elements in the data samples required for model training. Since the development of AI is still in the supervised learning stage, the learning and verification of data connotation information and the logic between data in the training process of AI algorithm models represented by deep learning are realized based on the feature identification of data, and the annotation of data is necessary, which is one of the core tasks of data preparation and even artificial intelligence project development. Similar to the rest of the data preparation workflow, data labeling is highly labor-dependent. Lengthy work cycles and huge labor costs have become one of the main factors restricting the development of the AI industry. The pain points on the supply side of data annotation services have generated market demand for automation tools and promoted the development and large-scale application of intelligent data annotation technology.

Figure 1: From data acquisition to AI-usable datasets

! [hJQWkT4AU2PQ3QOm8pPJJBmxxDyRyO7j0J6qvdlU.png] (https://img-cdn.gateio.im/webp-social/moments-40baef27dd-aef9208402-dd1a6f-cd5cc0.webp “7135831”)

At present, in the field of intelligent driving, the downstream of the largest application of data annotation, a large number of humans are still needed to label various scenarios, such as cats and dogs, telephone poles, strollers, etc. For example, Scale AI is an important data provider to OpenAI, and they have established their own data annotation studios in third world countries around the world to assist OpenAI in text/image data annotation.

However, with the advancement of AI, the proportion of pre-annotation in the workflow is gradually increasing. In the early days, data annotation was mostly done manually to build and accumulate machine learning datasets. Although relatively inefficient and costly, the data provided to the machine has a great advantage as long as the annotations are in place. Over time, the focus of manual annotation has gradually shifted from the United States to third-world countries such as Venezuela and the Philippines to reduce costs.

As the model develops, the accuracy of automated data annotation improves, and the model can be used to assist in manual annotation, such as the model preprocessing data and then sending it to human annotation, or the annotation results provided by the automated model are manually reviewed and corrected. Compared with pure manual annotation, AI-assisted annotation accelerates the speed of data annotation. Currently, one of the world’s largest data labeling companies, such as Scale AI, is working to reduce the proportion of human involvement in the data labeling process.

Although pre-annotation has achieved good results in the field of computer vision, in the new era of languages and large models, pre-annotation is still very immature and cannot completely replace human labor. The reasons are as follows:1. Low accuracy, especially when dealing with complex tasks and edge cases. 2. Sample bias and model hallucinations issues. 3. Some verticals require large datasets annotated by subject matter experts. 4. The scalability of pre-annotation is poor, especially for small languages or uncommon scenarios, the cost is high and the quality is poor, and it still needs to be completed manually.

In summary, pre-annotation will not completely replace manual annotation in the short term, and the two will coexist. While the percentage of manual annotation may decrease, auditors are still required to review data annotation during the annotation process.

Figure: Data labeling process under prelabeling

! [KZJdLcjAdtw08bJNZ6Z0ZURmCjqKjsv9LM9U4HrO.png] (https://img-cdn.gateio.im/webp-social/moments-40baef27dd-6c94f3b716-dd1a6f-cd5cc0.webp “7135843”)

The data annotation industry is not new, it began to emerge in 17/18 with the rise of intelligent driving. The chart below shows the predicted market size of data labeling providers in China, and it is worth mentioning that the market size of data labeling in the United States is roughly 3-5 times that of China.

The data labeling industry is a relatively fragmented market, not like a field with extremely high technical barriers, but more like a field with technical, human and organizational management barriers accounting for one-third each. The core competitiveness in this field is mainly reflected in the following aspects:1. Price 2. Quality 3. Expertise and knowledge coverage (diversity?)4. velocity

The price is obvious, because all people need a lot of cheap data. Price pressures drive a form of geographic arbitrage, whereas in the developed United States, it may cost $1 to pay a salary to complete a data label, while in less developed China, it costs only $0.5, and in the Philippines it may cost as little as $0.1. Therefore, one of the solutions in the market is to give orders to first-world countries and then recruit people in third-world countries to solve the problem through directly operated studios.

Data quality is also easy to understand, and high-quality data is required in the field of large models and intelligent driving. If the quality of the data fed into the model is poor, the performance of the large model will also suffer. One of the effective solutions to solve the data quality problem is to generate raw data through the pre-labeling of the model, and then manually annotate, and then continuously perform reinforcement learning and human feedback to improve the quality of data labeling. Or, the team needs to be very clear about the data labeling process for downstream customers, and be able to develop standard operating procedures (SOPs) so that data annotation staff can annotate according to the SOPs to improve quality.

But how do you understand expertise and knowledge coverage? Let’s take three examples:

  1. This is quite a challenge under the general model. Annotating large text models may be relatively easy, but you have to find people who can annotate multiple languages such as Chinese/English/French/German/Russian/Arabic, and how a data labeling company can recruit and manage so many distributed people on a global scale will be a challenge.

  2. Consider an AI application startup in the field of voicebots/digital humans. Startups often don’t have the time, manpower, and money to set up a data annotation team in-house. They needed to find an outsourced team to help label Chinese language families such as Sichuan accent, Cantonese accent, Shanghai accent, Northeast accent, etc., as well as English language families such as North American English accent, British English accent, and Singapore English accent. Finding a good data annotation studio in the market that can handle these tasks can be very difficult. If direct sales or subcontracting are adopted, it may take one or two months of working time from receiving orders to recruitment, which will seriously affect the efficiency of supply.

  3. Consider a more niche area, where a startup that focuses on legal models requires a lot of legal data annotation. The field of law still has quite high professional requirements, and startups need to find a data annotation provider that meets the following criteria:1. At least a dozen people who understand the law, and may also need to cover Chinese law, Hong Kong law, American law, etc.; Must be able to understand Chinese and English; 3. The cost can’t be too high. If you ask a lawyer to do the labeling, they may be reluctant to do the job because of the higher salary of the lawyer. Therefore, the current solution for this kind of segmentation can only be to recruit school interns internally to work on data annotation. For the management mode of direct sales and subcontracting, it is still quite difficult to complete the track of such subdivisions.

Thus, the major players in the market can be divided into three categories:1. Done in-house by large companies (e.g. Baidu crowdsourcing);2. Startups with a direct/subcontract model (analyzed below); Small and medium-sized data annotation studios.

Chart: The size of the data market in China’s AI market

! [F1zEq2z7zALsirAXyNV94uPmTLqwewBYopHlxyI5.png] (https://img-cdn.gateio.im/webp-social/moments-40baef27dd-edbb9fdd9b-dd1a6f-cd5cc0.webp “7135849”)

Before we dive in, let’s take a look at the current leading startups in the space:

  1. Scale AI: Scale AI’s main business in the United States covers four aspects: data annotation, management and evaluation (control the quality of annotated data and improve the efficiency of annotation), automation (auxiliary annotation to improve efficiency), and data synthesis (when the model is becoming more and more abundant, and the real data is not enough, it is necessary to automatically synthesize the data feeding model, and we will talk about the synthetic data track later). Scale AI initially focused on autonomous driving annotation, and two years ago, 80-90% of the company’s orders came from autonomous driving (2D, 3D, LiDAR, etc.), and this proportion has decreased in recent years. The company’s order source is in response to the industry trend of suppliers, and in recent years, the government, e-commerce, robots, large models and other fields have developed rapidly, coupled with the team’s keen ability to grasp industry trends, so it can maintain a high market share in each segment. In addition, Scale AI has launched its own Model as a Service service, such as helping customers finetune, hosting, and deploying models.

There are two types of charging models:

  • Consumption-base: For example, Scale Image starts at 2 cents per image and 6 cents per label, Scale Video starts at 13 cents per video frame and 3 cents per label, Scale Text starts at 5 cents per job and 3 cents per label, and Scale Document AI starts at 2 cents per job and 7 cents per label.

  • Project-base, which is based on the amount of data in the contract, etc., is actually a project-based income, with a unit value ranging from hundreds of thousands of dollars to tens of millions of dollars.

With projected revenue of $290 million in 2022 and a current valuation of $7 billion, Scale AI is the world’s largest data annotation company. The company’s investors are also very luxurious.

  1. Haitian AAC: China’s Haitian AAC also plays an important role in the field of data annotation. The company has rich experience in data annotation, data cleaning, data analysis, etc. However, information on its detailed business model, charging methods and financing is not yet clear.

  2. Appen: Australia’s Appen is another of the world’s leading data annotation companies. Similar to Scale AI, Appen provides services such as data annotation, voice data collection, and translation. The company has a large number of annotators around the world to provide customers with high-quality data annotation services. Appen’s detailed business model and financing are also worth further in-depth study.

! [xa4j0mwuoOYQ00imQe68w3BjAnA4g95Ujfgfyyt2.png] (https://img-cdn.gateio.im/webp-social/moments-40baef27dd-2e082f1e24-dd1a6f-cd5cc0.webp “7135866”)

! [a7IUQulVILcdWIgIDUEaI03FMCYU7v9dD8na50Z7.png] (https://img-cdn.gateio.im/webp-social/moments-40baef27dd-d87ea871ea-dd1a6f-cd5cc0.webp “7135867”)

These three companies occupy a significant position in the global data annotation space, representing the leading positions in this field in the United States, China, and Australia, respectively. Before we dive into the startups’ business models and market competition, an understanding of these leading companies will help provide a more comprehensive understanding of the context of the industry as a whole.

Haitian AAC is an A-share listed company, but it is not exactly a data labeling company. Compared with building its own team to do data annotation, Haitian is essentially a technical service provider, outsourcing orders to various studios. The core of Haitian AAC’s expansion in China depends on: 1. It has a deep accumulation in speech annotation, covering more than 190 languages (accounting for 70-80% of the revenue) 2. Scale effect 3. Good internationalization ability. In China, the data labeling industry is very wild and early, very scattered and disorderly, and there is also a lack of industry standards and norms.

! [6iWBdOeecyfMWXlJNqoFBPfQ2uR8DBFnFMCq1Lzp.png] (https://img-cdn.gateio.im/webp-social/moments-40baef27dd-5eb8a04957-dd1a6f-cd5cc0.webp “7135868”)

! [wLae6HBKOMqrzEuPewUKwzonMRcOT3qGYE3naIit.png] (https://img-cdn.gateio.im/webp-social/moments-40baef27dd-557bc22bf7-dd1a6f-cd5cc0.webp “7135871”)

We can look at the business model comparison between (Appen) and Haitian to see the business model of direct sales/outsourcing and the gross profit experience.
Figure: Direct/Outsourcing Business Models…

! [TQDXGwKEyjSFDYrMViQMs5PBpW3j7KXs4wMmU3ne.png] (https://img-cdn.gateio.im/webp-social/moments-40baef27dd-90760efac6-dd1a6f-cd5cc0.webp “7135872”)

! [RUb44Sii8E9I8kPM9J4yiUFtE7U7t52KUh1s6jd1.png] (https://img-cdn.gateio.im/webp-social/moments-40baef27dd-bc79aa85ac-dd1a6f-cd5cc0.webp “7135873”)

With so much foreshadowing, readers with good memories have not thought of how our title reshapes data annotation with blockchain. The full text hasn’t talked about the blockchain yet, how to reshape it?

The future of AI should be open and sovereign, whether it is data, computing power, or models, it should provide universal and open access to society on the basis of ensuring high quality and efficiency. All participants who help advance AI should have ownership rights for their own contributions and outputs, as well as reasonable distribution and rewards of benefits.

Our recent investment company, Quest Labs, aims to redefine the relationship between AI and people in the new era, and use AI and blockchain technology to disrupt and solve existing pain points in the industry. As a necessary shovel in the upstream of the AI industry chain, data service is the first problem that Quest wants to solve. Promote data production efficiency through AI, and redefine the economic model and value capture of public datasets in the new era through blockchain, which complement each other to continuously produce high value data and improve the ability and cognition of AI annotators.

  1. AI and Human Collaborative intelligence:
  • An intelligent human-in-the-loop, AI-centered infra to enable and incentivize human teams to smoothly interact with co-pilot models,提供高精度数据,并迭代提高质量,以在lifecycle中生成高价值数据
  • A decentralized marketplace, powered by the Humans Ops Tool, that maximizes the efficiency of decentralized workforce management and optimizes collaboration and communication across a global network of distributed teams
  1. Data Disclosure, Privacy, and Ownership
  • The platform deeply incentivizes user traffic and adhesion through paid cash flow and tokens, and constantly stimulates the data flywheel effect, capturing the behavior and historical data of both supply and demand to continuously learn from each other. Algorithms are used to recommend and formulate data demand frameworks to ensure future commercial value (hard domain mining), covering a large number of vertical segmentation scenarios. All data mark participants can start providing datasets in advance to be called up and commercialized, and receive cash flow and token rewards, ultimately becoming a valuable open AI data network in the new era.
  • Data encryption and privacy protection: ZK and FHE are used to better encrypt user data for processing and storage.
  • Blockchain technology is used to trace and verify the ownership of data by participants, including different outputs such as collection and annotation, and their corresponding values.
  1. New economic model
  • Through Meituan, a global AI data service platform that automatically matches AI, we will change from a centralized planned economy to a market economy.
  • Ensure the credibility of reputation + digital currency optimization settlement system through blockchain technology, infinitely expand the flow of people on the supply side to do accurate matching, so that the right people can do the right thing in order to be efficient and quality. Through the overlap of data labeling services and the poor population, employment + financial inclusion is achieved in disguise.
  1. Tokens are given to users to incentivize continuous learning and high-quality services and outputs, and at the same time incentivize users to provide high-quality and effective feedback to optimize the platform model to increase the efficiency and productivity of the entire pipeline (Human and AI mutual continuous learning).
  • Reasonable benefit distribution and value capture according to POPW through tokens, better reduce CAC, and then increase retention

From the perspective of the world of web2, this is a distribution platform for data annotation, a bit like Didi and Meituan Takeaway. But from the web3 point of view, this is an Axie Infinity+YGG with real cash flow. In the bull market of 2021, the combination of Axie and YGG brought a considerable number of third-world users into Web3, and this type of gaming guild has fed a very large number of third-world families during the epidemic, especially the Philippines. The market has also given Axie and YGG very good returns, and they are very interesting Alphas. As an investor in bridging Web2 and Web3, we are very willing to support projects and teams that use blockchain technology to contribute to real business, and we are looking forward to the team’s performance in the future. This is also the direction in which we see that few Web3 technologies can give wings to Web2 business.

ETH-3,37%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)