18 items with this tag.12/23/2025Apply for Alignment Mentorship From TurnTrout and Alex Cloudmats programshard theorycommunityAI10/30/2024Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sakeinstrumental convergenceshard theoryAI10/13/2023Paper: Understanding and Controlling a Maze-Solving Policy Networkactivation engineeringshard theoryAI5/13/2023Steering GPT-2-XL by Adding an Activation Vectoractivation engineeringshard theorymats programAI4/20/2023Behavioural Statistics for a Maze-Solving Agentmats programshard theoryAI4/1/2023Definitive Confirmation of Shard Theoryshard theoryhumorAI3/11/2023Understanding and Controlling a Maze-Solving Policy Networkactivation engineeringmats programshard theoryAI3/1/2023Predictions for Shard Theory Mechanistic Interpretability Resultsmats programshard theoryrationalityAI12/17/2022Positive Values Seem More Robust and Lasting than Prohibitionsshard theoryhuman valuesAI12/2/2022Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problemsshard theorycritiqueAI11/29/2022Alignment Allows “Non-Robust” Decision-Influences and Doesn’t Require Robust Gradingshard theoryhuman valuesAI10/6/2022A Shot at the Diamond-Alignment Problemshard theoryAI9/9/2022Understanding and Avoiding Value Drifthuman valuesshard theoryrationalityAI9/4/2022The Shard Theory of Human Valuesunderstanding the worldshard theoryhuman valuesrationalityAI8/8/2022General Alignment Propertiesshard theoryAI7/25/2022Reward Is Not the Optimization Targetreinforcement learningshard theoryAI7/14/2022Humans Provide an Untapped Wealth of Evidence About Alignmentshard theoryhuman valuesAI7/7/2022Human Values & Biases Are Inaccessible to the Genomeunderstanding the worldshard theoryhuman values