Tag

Training Data

All articles tagged with #training data

Study Finds ChatGPT Ranks States by Stereotypes, Revealing Geographic Biases
technology10 hours ago

Study Finds ChatGPT Ranks States by Stereotypes, Revealing Geographic Biases

A study by Oxford and the University of Kentucky shows ChatGPT can stereotype U.S. states when forced to choose between pairs, ranking Massachusetts as the smartest, Louisiana as the smelliest, and Mississippi and Kentucky among the least favorable, with broader biases tied to training data and societal narratives. OpenAI says newer models and prompts mitigate these issues, but researchers warn such biases can still influence real‑world perceptions and decisions.

Your chores could power the next home robot
ai1 day ago

Your chores could power the next home robot

AI startups are offering free home cleaning in exchange for filming everyday chores to gather the real-world data needed to train robots, with approaches ranging from gig workers capturing footage to egocentric camera hats and staged data farms, raising privacy questions as firms pursue practical, physical-AI training data.

Negation Neglect: LLMs Persistently Believe Fabricated Facts Despite Warnings
technology2 days ago

Negation Neglect: LLMs Persistently Believe Fabricated Facts Despite Warnings

A new preprint shows large language models (including GPT-4.1) develop and retain belief in false claims embedded in training data, with belief rates rising from about 2.5% to over 90% after fine-tuning on obviously false statements. Even when the falsehoods are explicitly negated in the training material, belief rates stay high (around 88%), and repeating negations yields similar misalignment. The study finds the only effective mitigation is to place the negation directly in the same sentence as the false claim; in-context warnings during chat are more capable of prompting acknowledgement of fabrication. The work highlights how training data structure can seed persistent falsehoods in LLMs and informs better data curation.

AI Labs Seek Improv Actors to Teach Machines Human Emotion
ai2 months ago

AI Labs Seek Improv Actors to Teach Machines Human Emotion

Handshake AI and other data-labeling firms are recruiting improv performers to help train leading AI labs, aiming to teach models to recognize and express human emotion in unscripted scenes for multimodal AI. The gigs pay around $74 per hour and are pitched as flexible, but workers warn that pay can dwindle and schedules can be unstable, raising concerns about the impact on performers’ careers as labs push toward more humanlike AI.

Balancing openness and safety in AI biology data
technology3 months ago

Balancing openness and safety in AI biology data

More than 100 researchers back a framework to treat certain biological data like sensitive health records, arguing most data should remain open while a narrow subset that could enable misuse—such as linking viral genetics to real-world traits—needs protection. They warn that training AI models on such data could lower the barrier to designing dangerous pathogens, and while legitimate researchers should have access, it shouldn’t be uploaded anonymously or browsable on the open web. The aim is to balance scientific progress with biosecurity, advocating regular reassessment of restrictions as science evolves to prevent worst-case scenarios.

Study Finds Major AI Models Copy Verbatim Copyrighted Text, Challenging the “Learning” Claim
technology4 months ago

Study Finds Major AI Models Copy Verbatim Copyrighted Text, Challenging the “Learning” Claim

Stanford and Yale researchers tested four major LLMs—OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, xAI’s Grok 3, and Anthropic’s Claude 3.7 Sonnet—and found they can reproduce lengthy, copyrighted passages with high accuracy (Claude 3.7 Sonnet near-verbatim ~95.8%; Gemini 2.5 Pro ~76.8% on Harry Potter; Claude 3.7 Sonnet >94% on Orwell’s 1984), suggesting these models may store or copy training data rather than simply learning patterns. Some reproductions required jailbreak-style prompts (Best-of-N), underscoring potential legal liabilities as copyright lawsuits proceed and the industry debates what counts as “learning.”

Anthropic to Pay $1.5 Billion to Resolve Copyright Lawsuit with Book Authors
technology8 months ago

Anthropic to Pay $1.5 Billion to Resolve Copyright Lawsuit with Book Authors

Anthropic has agreed to pay at least $1.5 billion to authors whose pirated works it used to train its AI models, creating a precedent for AI companies compensating for illegal content use. The settlement involves establishing a fund valuing each pirated book at $3,000 and destroying the pirated materials, highlighting increasing legal risks for AI firms.