As AI model intelligence peaks, its reliance on complex, human-curated data is only deepening.
They started with microtasks such as transcribing audio files, marking tick boxes, translating language and labelling objects in images. Now, data annotators are correcting software code, checking financial statements and analysing diagnostic reports, as the training needs of artificial intelligence models become more complex.
Data annotation, or simply data labelling, is the most crucial and foundational step for building high-quality datasets to train AI models, enhance accuracy, curtail hallucinations and build safety guardrails against inappropriate or harmful content. And India is fast emerging as a hub for data annotation services with flexible workers, mid-tier business analysts and even skilled data engineers, auditors, radiologists, lawyers, etc., contributing to building high-quality datasets.
“Honestly, I think we need to retire the term ‘data labelling’,” says Jonathan Siddharth, founder of Palo Alto-based talent and AI tools company Turing. “It’s like calling a smartphone a ‘portable telephone’.”
“What we’re doing now is fundamentally different. We’re not tagging cats and dogs; we’re orchestrating teams of Olympiad-level talent to solve highly complex problems across industries,” he said. AI models have got so smart that sometimes you need a physicist, a software engineer, and a data scientist working together just to generate data that challenges them, he explained.
Harshul Arora, founder and CEO of early-stage startup Macgence, said his company is focussing on curating custom datasets for AI/ML models and agents. “Businesses now have custom data sourcing needs which capture linguistic and cultural nuances. These datasets are not available on open libraries like Hugging Face,” he said.
Riding the growth wave
The global market for data annotation is likely to expand from about $6.5 billion in 2025 to nearly $20 billion by 2030, growing at about 25–30% each year, according to staffing firm TeamLease Digital. In India, the market was worth $80 million in 2023 and is expected to reach nearly $500 million by 2030, growing at almost 30% each year, it said.
And this has reflected in the growth of the workforce in this segment from 20,000 in 2022 to 70,000 currently. These include annotators, quality controllers and project managers, who work in startups, IT services and crowdsourcing platforms.
“Data annotation has grown more complex with the rise of LLMs, leading to the emergence of specialised, higher-paying roles for domain-specific tasks,” said Kapil Joshi, CEO – Quess IT Staffing, adding that some of its clients have grown 50% year-on-year. With this growth, the sector will soon witness a talent scarcity, said TeamLease Digital CEO Neeti Sharma. “By 2026, the industry could face a shortage of 40–50% in skilled professionals.”
“As models evolve, data demands will shift — certain types of data may require lower volumes but others will rapidly expand,” said Ryan Kolln, CEO of Appen, a Australia-based company which has delivered over 15,000 AI data projects, including LLM fine-tuning, evaluation, red teaming, and multimodal annotation. “A good example of this is LLM work, where elementary math question data is reducing, but data is still growing in demand for more complex STEM (science, technology, engineering and mathematics) problems,” he said.
The sector’s importance is underscored by Meta’s recent $14.3 billion deal to acquire a 49% stake in Scale AI, valuing the data company at $29 billion. This has opened a multi-million opportunity for global companies like Turing and Appen as tech giants OpenAI, Google, Microsoft have reportedly terminated their contracts with Scale. Turing’s Siddharth said the deal validates that “data is as strategic as compute in the race to AGI (artificial general intelligence), and signals that the scale of investment here will rival or even exceed billions annually across frontier labs”. In the past weeks, Turing has added potential contracts worth $50 million, the Time reported.
The India advantage
Data companies have long depended on India’s talent and scale for servicing global projects. “The depth of technical expertise — from IIT grads to domain-specific PhDs in math, physics and engineering — is extraordinary. And it’s evolving in sync with what AI needs: not just coding talent, but frontier minds who can help push the limits of reasoning, multimodality and agentic workflows,” said Siddharth of Turing, whose 40% workforce is based in India.
He added that data labs need the best minds to compete, “not just recycle the same talent pool in Silicon Valley. When a physicist in Bengaluru helps train a model that might cure diseases, or an engineer in Pune improves an AI that could revolutionise education, that’s the democratisation of both intelligence and opportunity”.
Appen’s Kolln pointed out that logical thinking and problem-solving skills are strong in the Indian education system given the strong emphasis on mathematics and science. The company has a pool of 50,000 contributors from India.
Hardik, founder and CEO of Indika AI, said: “Over the past three years, we’ve seen strong global demand for multilingual, domain-specific data infrastructure which translated into 5X top line growth for us.” The company’s freelance platform, Flexibench, has 70,000 registered contributors, 5%-10% of whom are working actively at any given time, he added.
They started with microtasks such as transcribing audio files, marking tick boxes, translating language and labelling objects in images. Now, data annotators are correcting software code, checking financial statements and analysing diagnostic reports, as the training needs of artificial intelligence models become more complex.
Data annotation, or simply data labelling, is the most crucial and foundational step for building high-quality datasets to train AI models, enhance accuracy, curtail hallucinations and build safety guardrails against inappropriate or harmful content. And India is fast emerging as a hub for data annotation services with flexible workers, mid-tier business analysts and even skilled data engineers, auditors, radiologists, lawyers, etc., contributing to building high-quality datasets.
“Honestly, I think we need to retire the term ‘data labelling’,” says Jonathan Siddharth, founder of Palo Alto-based talent and AI tools company Turing. “It’s like calling a smartphone a ‘portable telephone’.”
“What we’re doing now is fundamentally different. We’re not tagging cats and dogs; we’re orchestrating teams of Olympiad-level talent to solve highly complex problems across industries,” he said. AI models have got so smart that sometimes you need a physicist, a software engineer, and a data scientist working together just to generate data that challenges them, he explained.
Harshul Arora, founder and CEO of early-stage startup Macgence, said his company is focussing on curating custom datasets for AI/ML models and agents. “Businesses now have custom data sourcing needs which capture linguistic and cultural nuances. These datasets are not available on open libraries like Hugging Face,” he said.
Riding the growth wave
The global market for data annotation is likely to expand from about $6.5 billion in 2025 to nearly $20 billion by 2030, growing at about 25–30% each year, according to staffing firm TeamLease Digital. In India, the market was worth $80 million in 2023 and is expected to reach nearly $500 million by 2030, growing at almost 30% each year, it said.
And this has reflected in the growth of the workforce in this segment from 20,000 in 2022 to 70,000 currently. These include annotators, quality controllers and project managers, who work in startups, IT services and crowdsourcing platforms.
“Data annotation has grown more complex with the rise of LLMs, leading to the emergence of specialised, higher-paying roles for domain-specific tasks,” said Kapil Joshi, CEO – Quess IT Staffing, adding that some of its clients have grown 50% year-on-year. With this growth, the sector will soon witness a talent scarcity, said TeamLease Digital CEO Neeti Sharma. “By 2026, the industry could face a shortage of 40–50% in skilled professionals.”
“As models evolve, data demands will shift — certain types of data may require lower volumes but others will rapidly expand,” said Ryan Kolln, CEO of Appen, a Australia-based company which has delivered over 15,000 AI data projects, including LLM fine-tuning, evaluation, red teaming, and multimodal annotation. “A good example of this is LLM work, where elementary math question data is reducing, but data is still growing in demand for more complex STEM (science, technology, engineering and mathematics) problems,” he said.
The sector’s importance is underscored by Meta’s recent $14.3 billion deal to acquire a 49% stake in Scale AI, valuing the data company at $29 billion. This has opened a multi-million opportunity for global companies like Turing and Appen as tech giants OpenAI, Google, Microsoft have reportedly terminated their contracts with Scale. Turing’s Siddharth said the deal validates that “data is as strategic as compute in the race to AGI (artificial general intelligence), and signals that the scale of investment here will rival or even exceed billions annually across frontier labs”. In the past weeks, Turing has added potential contracts worth $50 million, the Time reported.
The India advantage
Data companies have long depended on India’s talent and scale for servicing global projects. “The depth of technical expertise — from IIT grads to domain-specific PhDs in math, physics and engineering — is extraordinary. And it’s evolving in sync with what AI needs: not just coding talent, but frontier minds who can help push the limits of reasoning, multimodality and agentic workflows,” said Siddharth of Turing, whose 40% workforce is based in India.
He added that data labs need the best minds to compete, “not just recycle the same talent pool in Silicon Valley. When a physicist in Bengaluru helps train a model that might cure diseases, or an engineer in Pune improves an AI that could revolutionise education, that’s the democratisation of both intelligence and opportunity”.
Appen’s Kolln pointed out that logical thinking and problem-solving skills are strong in the Indian education system given the strong emphasis on mathematics and science. The company has a pool of 50,000 contributors from India.
Hardik, founder and CEO of Indika AI, said: “Over the past three years, we’ve seen strong global demand for multilingual, domain-specific data infrastructure which translated into 5X top line growth for us.” The company’s freelance platform, Flexibench, has 70,000 registered contributors, 5%-10% of whom are working actively at any given time, he added.
You may also like
'Pick party first': Kerala Congress snubs Tharoor's 'CM ambition'; says praising PM Modi 'not correct'
'Will continue to ask': Bhagwant Mann rebuffs MEA's condemnation; says he has right to question PM's foreign tours
BJP accepts Telangana MLA Raja Singh's resignation
Maharashtra Minister Sanjay Shirsat Seen Smoking With Bag Full Of Cash Next To Him; Shocking Video Surfaces Day After I-T Notice Served
Millions more Brits hit by major new hosepipe ban amid heatwave