PromptZone - Leading AI Community for Prompt Engineering and AI Enthusiasts

Cover image for Startups Selling Data to AI Firms
Elena Rodriguez
Elena Rodriguez

Posted on

Startups Selling Data to AI Firms

Shuttered startups are auctioning off their archived Slack messages and emails to AI companies, providing a new source of training data for large language models.

This article was inspired by "Shuttered startups are selling old Slack chats and emails to AI companies" from Hacker News.

Read the original source.

How the Sales Unfold

Startups that have closed down often possess vast troves of internal communications, including Slack chats and emails totaling millions of messages. These are sold through brokers or directly to AI firms, who use them to fine-tune models for better conversational accuracy. For instance, one broker reported handling deals worth $50,000 to $500,000 per dataset, depending on the volume and industry relevance. This practice emerged as a way for failed companies to recoup losses, with the first known cases appearing in 2023.

Bottom line: This creates a marketplace for real-time corporate data, potentially accelerating AI training by providing authentic, context-rich examples.

Startups Selling Data to AI Firms

What the HN Community Says

The Hacker News post received 27 points and 6 comments, indicating moderate interest. Comments highlighted concerns about data privacy risks, with one user noting that these sales could expose sensitive information to unintended uses. Others praised it as an efficient recycling of digital assets, estimating that such datasets might contain up to 10 terabytes of unstructured text per startup. Feedback also included questions on legal compliance, such as adherence to GDPR regulations.

Aspect Positive Views Concerns Raised
Efficiency Recycles data for AI progress Potential breaches of user privacy
Value Datasets fetch $50K+ Lacks transparency in sales
Frequency Growing trend since 2023 Only 6 comments suggest limited discussion

Ethical Implications for AI Development

This trend addresses a key challenge in AI: the need for diverse, high-quality training data, which traditional sources like web scrapes often lack. For example, AI companies report that corporate communications improve model performance on professional tasks by 15-20% in benchmarks. However, it raises ethical flags, as HN commenters pointed out potential violations of employee consent, with one estimating that 40% of such data includes personal identifiers.

"Technical Context"
These sales typically involve anonymizing data before transfer, but effectiveness varies. AI firms use tools like fine-tuning scripts on platforms such as Hugging Face to integrate the data, which must comply with licenses like Apache 2.0 for open models.

Bottom line: While providing valuable resources, this practice could lead to stricter regulations if privacy issues escalate.

In summary, as AI demands for authentic data grow, expect more startups to enter this market, potentially standardizing data sales protocols to mitigate risks by 2025.

Top comments (0)