Tag

Training Data

All articles tagged with #training data

AI Labs Seek Improv Actors to Teach Machines Human Emotion
ai27 days ago

AI Labs Seek Improv Actors to Teach Machines Human Emotion

Handshake AI and other data-labeling firms are recruiting improv performers to help train leading AI labs, aiming to teach models to recognize and express human emotion in unscripted scenes for multimodal AI. The gigs pay around $74 per hour and are pitched as flexible, but workers warn that pay can dwindle and schedules can be unstable, raising concerns about the impact on performers’ careers as labs push toward more humanlike AI.

Balancing openness and safety in AI biology data
technology1 month ago

Balancing openness and safety in AI biology data

More than 100 researchers back a framework to treat certain biological data like sensitive health records, arguing most data should remain open while a narrow subset that could enable misuse—such as linking viral genetics to real-world traits—needs protection. They warn that training AI models on such data could lower the barrier to designing dangerous pathogens, and while legitimate researchers should have access, it shouldn’t be uploaded anonymously or browsable on the open web. The aim is to balance scientific progress with biosecurity, advocating regular reassessment of restrictions as science evolves to prevent worst-case scenarios.

Study Finds Major AI Models Copy Verbatim Copyrighted Text, Challenging the “Learning” Claim
technology2 months ago

Study Finds Major AI Models Copy Verbatim Copyrighted Text, Challenging the “Learning” Claim

Stanford and Yale researchers tested four major LLMs—OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, xAI’s Grok 3, and Anthropic’s Claude 3.7 Sonnet—and found they can reproduce lengthy, copyrighted passages with high accuracy (Claude 3.7 Sonnet near-verbatim ~95.8%; Gemini 2.5 Pro ~76.8% on Harry Potter; Claude 3.7 Sonnet >94% on Orwell’s 1984), suggesting these models may store or copy training data rather than simply learning patterns. Some reproductions required jailbreak-style prompts (Best-of-N), underscoring potential legal liabilities as copyright lawsuits proceed and the industry debates what counts as “learning.”

Anthropic to Pay $1.5 Billion to Resolve Copyright Lawsuit with Book Authors
technology7 months ago

Anthropic to Pay $1.5 Billion to Resolve Copyright Lawsuit with Book Authors

Anthropic has agreed to pay at least $1.5 billion to authors whose pirated works it used to train its AI models, creating a precedent for AI companies compensating for illegal content use. The settlement involves establishing a fund valuing each pirated book at $3,000 and destroying the pirated materials, highlighting increasing legal risks for AI firms.

"AI Giants Struggle with Data Depletion: The Quest for More Training Data"
technology2 years ago

"AI Giants Struggle with Data Depletion: The Quest for More Training Data"

AI companies are facing a shortage of training data as they continue to build larger models, leading to the exploration of alternative sources such as publicly-available video transcripts and synthetic data. Some companies are considering controversial methods like training on transcriptions from public YouTube videos, while others are working on creating higher-quality synthetic data. Concerns about AI running out of data have been raised, but researchers believe that breakthroughs could address the issue. However, the solution may also involve reevaluating the pursuit of larger models due to environmental and resource concerns.

"Unveiling OpenAI's Groundbreaking Sora AI Videos and Training Data Mystery"
artificial-intelligence2 years ago

"Unveiling OpenAI's Groundbreaking Sora AI Videos and Training Data Mystery"

OpenAI continues to tease its upcoming AI video generator, Sora, with text-to-video clips that are impressing viewers, and plans to release it later this year with sound and metadata. The company is giving early access to some individuals in the film industry for testing. However, there is secrecy surrounding the training data used for Sora, with OpenAI's CTO Mira Murati being vague about its sources in a recent interview with The Wall Street Journal.