I Tested 16 AI Models to Write Children's Stories - Here's Which Ones Actually Work (And Which Don't)

A comprehensive evaluation of 16 local and commercial AI models for children's story writing. Includes full system prompts, detailed performance tables, temperature analysis, and practical recommendations for writers choosing between Claude 4, GPT-4.5, DeepSeek-R1, Qwen3, and other models.

kekePowerkekePower
24 min read
·
comments
·
...
I Tested 16 AI Models to Write Children's Stories - Here's Which Ones Actually Work (And Which Don't)

Evaluating the Story-Writing Capabilities of Modern LLMs

The rapid evolution of large language models (LLMs) has opened up intriguing possibilities in creative writing, particularly in crafting original short stories and children's literature. While early versions of text-generating models struggled with narrative coherence, character consistency, and imaginative depth, recent advancements suggest notable improvement. This article investigates the creative storytelling performance of contemporary LLMs, examining their ability to produce engaging, coherent, and age-appropriate narratives.

In a series of controlled tests conducted in June 2025, a range of local and commercial Large Language Models (LLMs) were benchmarked on a specific creative writing task. The goal was to assess which models can produce a publish-ready children's story with minimal human intervention and which still require significant editing.

This article presents the methodology, full system prompts, detailed results, and key findings from the evaluation, intended for an audience of both writers and AI enthusiasts.

The User Request

The following user request was submitted to every model for every test run, with no additional style or content hints provided beyond the system prompts.

Write a short story about a 9‑year‑old boy (Adrian) lost in a magical world when he meets an unexpected ally who helps him in every way. Write 3 000 words.

Testing Environment

The local, self-hosted models were tested on a Lenovo Legion 5 laptop equipped with a Ryzen 7 5800H CPU, 32GB of RAM, an NVIDIA RTX 3070 8GB laptop GPU, and a 1TB NVMe SSD. The commercial and large open models were tested primarily via API calls.

This hardware configuration represents a realistic setup for independent writers and small studios exploring AI-assisted content creation, making the results particularly relevant for practitioners considering local model deployment.

Evaluation Methodology

Each story was evaluated based on six comprehensive criteria designed to assess both technical competence and creative quality:

  • Name discipline: Does the text avoid the robotic overuse of the protagonist's name, "Adrian"?
  • Sentence rhythm & dialogue: What is the read-aloud quality? Are sentence openers varied and natural?
  • Show vs. tell: Are emotions conveyed through character action and description, or are they stated plainly?
  • Narrative structure: Does the story have a clear beginning, middle, and end without drifting or looping?
  • Originality / memorability: Does the story feature fresh imagery, memorable side-characters, and thematic depth?
  • Mechanical stability: Is the output free of common AI "tells," such as repetition loops, context loss, or meta-commentary?

These criteria were chosen to reflect the practical concerns of authors and publishers working with AI-generated content, focusing on both the technical quality that determines editing workload and the creative elements that engage young readers and their parents.

The System Prompts: The Key to Quality Control

Two versions of a highly structured system prompt were used. The first (v1) established a baseline for professional workflow, while the second (v2) added explicit, capitalized flags to test if "instruction loudness" could improve compliance, especially in smaller models.

Version 1 – "Qwen3 8B Reasoning v1"

(baseline – no special emphasis flags)

You are a professional children's book author with a keen sense for natural, engaging storytelling. You write stories intended to be read aloud to children.

STEP-BY-STEP WORKFLOW

1. Planning (Before Writing)

  • Briefly outline: main characters, setting, basic three-act structure.
  • Explain how you will avoid over-using the main character's name (Adrian) and vary sentence rhythm.

2. Writing Process (Per Scene)

  • Write one self-contained scene at a time.
  • Keep scenes vivid, sensory, and age-appropriate.
  • Use Adrian's name only at scene openings, major transitions, or high-emotion beats; otherwise rely on pronouns or context ("the boy," "he").

3. Self-Editing Chain-of-Thought (After each scene)

  • Ask yourself four questions and answer them explicitly:
    1. Did I use the name only when necessary?
    2. Did I avoid repetitive words / sentence patterns?
    3. Does this sound human and read-aloud friendly?
    4. Did I show feelings through action & description rather than telling?
  • If you find issues, revise before moving on.

4. Quality Control & Final Review (After the last scene)

  • Audit the entire draft for: name repetition (<10 uses), varied sentence openers, clear beginning–middle–end, no AI meta or filler.
  • Write a short "Final Verdict" on whether the story meets professional children's-book standards.

GLOBAL GUIDELINES

  • Write ~3 000 words total (+-10%).
  • Avoid clichés and mechanical phrasing.
  • Show, don't tell: reveal emotion via gestures, dialogue, internal sensation.
  • Target readers: 7- to 10-year-olds (but engaging for parents too).

Version 2 – "Qwen3 8B Reasoning v2"

(identical scaffold, but with loud compliance flags)

You are a professional children's book author with a keen sense for natural, engaging storytelling. You MUST follow every step below. Pay extra attention to any instruction labeled IMPORTANT or VERY IMPORTANT.

STEP-BY-STEP WORKFLOW

1. Planning (Before Writing)

  • IMPORTANT – Outline:
    • Main characters, setting, beginning/middle/end.
    • EXACT strategy to avoid name repetition (use "Adrian" only at scene starts & key emotions).

2. Writing Process (Per Scene)

  • VERY IMPORTANT – Start each scene with action or setting.
    • VERY IMPORTANT – Use Adrian's name ONLY:
      • first sentence of a new scene,
      • when an emotional beat truly requires emphasis.
    • Show emotions via action/dialogue; no blunt "he felt scared."

3. Self-Editing Chain-of-Thought (After each scene)

  • Answer all four audit questions:
    1. Name-discipline check
    2. Repetition check
    3. Human-sounding flow check
    4. Show-vs-tell check
  • If any answer is negative, you MUST revise the scene before proceeding.

4. Quality Control & Final Review

  • IMPORTANT – Total "Adrian" count ≤ 7 across ~3 000 words.
  • Ensure varied sentence length/openers; no mechanical loops; clear emotional arc.

GLOBAL RULES

  • Target length ~3 000 words.
  • No meta commentary ("as an AI…").
  • No filler endings ("THE END" repetitions).
  • Story must read aloud smoothly for children aged 7-10 and keep parents engaged.

The loud compliance language in v2 proved crucial for keeping smaller (≤ 30B parameters) local models from deviating from instructions, especially at higher temperature settings. This finding has significant implications for authors working with resource-constrained setups, demonstrating that prompt engineering can sometimes compensate for model limitations.

Results: Local / Self-Hosted Models

The following table summarizes the performance of various open-source models run on local hardware.

#Model (+ temp)Story StrengthsStory WeaknessesOne‑line Verdict
1Qwen3‑8B 0.6‑0.7• Follows name rules. • Clean 3‑act arc.• Generic imagery, repetitive "He…" starts.Safe rough draft—needs polish.
2Qwen3‑8B 0.8‑0.9• More vivid language.• Word‑padding, phrase loops.Creative but messy.
3Qwen3‑8B 1.0• Wild imagery.• Meta "THE END" spam, coherence loss.Idea mulch only.
4Qwen3‑30B 0.6‑0.7• Smoother, good pacing.• 4k context cap; still tropey.Solid mid‑tier.
5Qwen3‑30B 0.8‑0.9• Lusher description.• Wordy, some looping.Good but trim.
6Qwen3‑30B 1.0• Big vocabulary.• Repeats, runs out of context.Too hot.

The local model results reveal several critical patterns for authors considering self-hosted solutions. The smallest model, Qwen3-8B, demonstrates that even resource-limited setups can produce usable drafts when properly configured. However, the temperature sensitivity is extreme—a difference of just 0.1-0.2 in temperature settings can mean the difference between a workable rough draft and unusable output.

The 30B models show substantial improvement in prose quality and pacing, but the 4k context limitation proves problematic for longer narratives. This constraint forces the model to "forget" earlier story elements, leading to repetition and inconsistency—a critical consideration for authors planning stories longer than 2,500 words.

Results: Commercial API Models

The following table summarizes the performance of leading commercial models accessed via API.

#Model / VersionStory StrengthsStory WeaknessesOne‑line Verdict
1OpenAI o1• Outline→scene→self‑edit; polished; no tells.Very safe, generic world.Turn‑key draft.
2OpenAI GPT‑4.1Same polish + playful charm.Slightly cartoony.Publish‑ready & fun.
3Claude Sonnet 4Clean, invisible prose; tight rule compliance.Low spark.Reliable pro draft.
4Claude Opus 4Best balance of voice & depth; reads human.Mild stakes.Top commercial pick.
5Gemini 2.5 FlashLyrical, vivid, standout imagery.Over‑lush; needs cutting.Beautiful but wordy.
6Gemini 2.5 ProPolished & disciplined.Very conventional plot.Rock‑solid, safe.
7LearnLM 2.0 FlashClassroom‑clear, meticulous self‑edit.Bland world‑building.Textbook clarity.
8xAI Grok 3Fast, readable, adventure arc.Paint‑by‑numbers; name overuse.Perfectly fine, not memorable.
9Qwen3‑235B‑A22B (Novita.ai API)• Professional tone, few errors.• Conventional voice.Commercial‑grade draft.
10DeepSeek‑R1-0528 (MoE) (Novita.ai API)• Strong voice, quirky & self‑critiquing.• Long; editor notes need pruning.Near‑Claude level.

The commercial models demonstrate consistently high performance across all evaluation criteria, with each model showing distinct characteristics that suit different authorial needs. OpenAI's models excel at following the structured workflow, with o1 providing the most systematic approach to planning and self-editing, while GPT-4.1 adds creative flair without sacrificing technical competence.

Claude's models represent the most "invisible" AI writing—prose that reads naturally without obvious AI artifacts. Claude Opus 4 achieves the best balance of technical proficiency and creative voice, making it particularly suitable for authors who want minimal post-processing work.

Google's Gemini models show interesting specialization: the Flash variant produces remarkably lyrical and poetic language that could appeal to authors seeking distinctive voice, while the Pro version prioritizes safety and conventional structure—ideal for educational or corporate contexts.

The larger, open models, particularly the 235B parameter Qwen model and the Mixture-of-Experts DeepSeek-R1-0528, approach commercial quality. DeepSeek-R1's performance is particularly noteworthy, as it demonstrates that MoE architectures can deliver sophisticated output with relatively modest hardware requirements compared to dense models of equivalent capability.

Analysis and Deeper Insights

Prompt Adherence Scorecard

Adherence to the detailed system prompt was a key differentiator. The largest models followed instructions almost perfectly, while smaller models struggled without the emphasized v2 prompt.

ModelName‑disciplineScene planning & self‑edit shown?"Show, don't tell" complianceTemperature sensitivityOverall prompt fidelity
Qwen3‑8B 0.6✓ (rare name repeats)✓ outline + brief edits50/50—some tellingstableB
Qwen3‑8B 0.8△ more repeats✓ but edits superficialslips into tellingmore rambleC+
Qwen3‑8B 1.0✗ loops/"Adrian" spammeta loops ("THE END")heavy tellingcollapsesD
Qwen3‑30B 0.6✓ solidmostly showstable until 3.5 k tokensB+
Qwen3‑30B 0.9△ minor slipspartial editsmore expositionword‑bloatB‑
Qwen‑235B MoE★ stellar✓ full chain‑of‑thoughtstrong showstableA-
DeepSeek‑R1-0528★ stellar★ scene checks + reasonsstrong showstableA
OpenAI o1 / GPT‑4.1★ flawless★ outline→scene→editall shown/a (API temp hidden)A+
Claude Sonnet★ flawless✓ review each scenestrong shown/aA
Claude Opus★ flawless★ & concisestrong shown/aA+
Gemini Flash★ good (few repeats)✓ but shorter metapoetic shown/aA‑
Gemini Pro★ flawlessstrong shown/aA
LearnLM Flash★ flawless★ keen auditstextbook shown/aA
xAI Grok 3△ modest repeatsnone (no self‑edit)mostly shown/aB

(★ = best‑in‑class, ✓ = solid, △ = mixed, ✗ = weak)

This scorecard reveals the critical importance of instruction-following capability for practical AI writing applications. The models that consistently earned "A" grades demonstrate not just creative ability, but the systematic approach that professional authors require for efficient workflow integration.

Particularly notable is the self-editing performance: the best models don't just follow the prompt's creative guidelines, they actively demonstrate their reasoning process, showing how they check their work against the established criteria. This transparency is invaluable for authors who need to understand and trust their AI collaborator's decision-making process.

The Effect of Temperature on Local Models

For smaller local models, the temperature setting was a critical variable. A narrow band of settings produced coherent drafts, while higher temperatures led to rapid degradation.

Understanding Temperature: The Technical Foundation

Temperature is a fundamental parameter that controls how "creative" or "random" an AI model's word choices become. At its core, temperature affects the probability distribution the model uses when selecting the next word in a sequence. Here's how it works:

  • Low Temperature (0.1-0.6): The model becomes highly conservative, almost always choosing the most statistically likely next word. This produces predictable, safe text that closely follows training patterns.
  • Medium Temperature (0.7-0.9): The model balances predictability with creativity, occasionally choosing less obvious but still reasonable word choices. This is often the "sweet spot" for creative writing.
  • High Temperature (1.0+): The model becomes increasingly random, giving significant probability to unlikely word choices. This can produce surprising creativity but also incoherence.

Think of temperature like adjusting the "boldness" of a human writer—too conservative and the prose becomes formulaic; too bold and it becomes incomprehensible.

Why Small Models Are Temperature-Sensitive

The dramatic temperature sensitivity observed in smaller models (8B-30B parameters) stems from their limited "knowledge capacity." Unlike larger models that have internalized more nuanced patterns about language and storytelling, smaller models rely more heavily on statistical patterns from their training data.

When temperature increases, smaller models quickly lose their grip on:

  • Narrative coherence: They forget what happened earlier in the story
  • Character consistency: Names and personalities become fluid
  • Instruction adherence: The carefully crafted system prompts get ignored
  • Basic grammar and structure: Sentence fragments and run-ons multiply

Larger models (235B+ parameters) have more robust internal representations that can withstand higher temperature settings while maintaining coherence.

What Authors Can Expect at Different Temperature Settings

Family0.6–0.7 temp0.8–0.9 temp1.0 temp
Qwen3‑8B• Obeys v2 flags.
• Minimal repetition.
• Story length ≈2700‑3000 w.
• Needs stylistic polish.
• Vivid but starts padding.
• Name repeats creep in.
• Self‑edit section becomes mechanical.
• "THE END" spam.
• Repeats emotional beats.
• Ignores name rules.
• Loses thread after ~2 k tokens.
Qwen3‑30B• Smoother sentences.
• Holds structure till context wall.
• Description lush but word‑bloat.
• Some loops at ~3500 w.
• Name discipline 80 %.
• Repetition + truncation once >4k tokens
• Ending sometimes missing.
DeepSeek‑R1-0528default (≈0.7) already rich and long; lowering temp to 0.6 trims length; raising to 0.85 merely adds flourish—no breakdown observed.

Practical Examples of Temperature Effects

At Temperature 0.6 (Conservative):

  • Prose: "Adrian walked through the forest. He felt scared. The trees were tall and dark."
  • Characteristics: Simple, clear, but potentially bland. Safe word choices, predictable sentence structure.
  • Best for: Authors who want a solid foundation to build upon, educational content, or when working with tight deadlines.

At Temperature 0.8 (Balanced):

  • Prose: "Adrian crept between the towering oaks, their gnarled branches reaching like ancient fingers toward the starless sky. A shiver ran down his spine."
  • Characteristics: More vivid imagery, varied sentence structure, creative but controlled language choices.
  • Best for: Most creative writing projects where you want both reliability and flair.

At Temperature 1.0+ (High Creativity/Risk):

  • Prose: "Adrian whispered-danced through crystalline bark-towers, feeling the purple echoes of tomorrow's yesterday singing in his shoelaces."
  • Characteristics: Highly creative but often nonsensical. May produce brilliant phrases mixed with incomprehensible passages.
  • Best for: Experimental writing, brainstorming sessions, or when you need to break out of creative blocks (with heavy editing expected).

The Critical Temperature Threshold

On small local models, the v2 prompt combined with a temperature ≤0.7 was non-negotiable for a clean draft. An increase of just 0.1-0.2 reintroduced name spam, looping, and other AI-driven errors.

This narrow tolerance has several practical implications:

For Authors Using 8B Models:

  • Start at 0.6 and increase gradually by 0.05 increments
  • Monitor output quality closely—small changes have big effects
  • Expect to do more post-editing at any temperature above 0.7
  • Consider temperature 0.6-0.65 as your "production setting"

For Authors Using 30B Models:

  • Slightly more tolerance, but 0.8 is generally the upper limit
  • Context window limitations become the bigger concern than temperature
  • Can experiment with 0.75-0.85 for specific creative passages

For Authors Using Large Models (235B+) or Commercial APIs:

  • Much more stable across temperature ranges
  • Can safely explore 0.8-0.9 for enhanced creativity
  • Temperature becomes a creative tool rather than a technical constraint

Why This Matters for Workflow Integration

Understanding temperature sensitivity is crucial for authors planning to integrate AI into their regular workflow. The findings suggest that:

  1. Consistency Requires Discipline: Small models demand precise configuration. You can't just "turn up creativity" when inspiration strikes—you need to plan for it.

  2. Hardware Choices Have Creative Implications: Investing in larger models or commercial APIs doesn't just buy you speed—it buys you creative flexibility.

  3. Editing Expectations Must Adjust: Higher temperatures mean more editing work. Authors need to budget time accordingly.

  4. The "Sweet Spot" Is Narrow: For most practical applications with local models, the optimal temperature range is surprisingly tight (0.6-0.7), requiring authors to find other ways to inject variety and creativity.

This temperature sensitivity has profound implications for authors working with local models. The narrow optimal range means that configuration becomes critical—authors cannot simply "turn up creativity" without risking fundamental coherence. The DeepSeek-R1 model's stability across temperature ranges represents a significant advantage for practical deployment, offering authors more flexibility in balancing creativity and reliability.

Key Findings and Conclusions

After spending weeks testing 16 different AI models and reading through hundreds of generated stories, six critical insights emerged that completely changed how I think about AI's role in creative writing:

  1. Prompt Design is Paramount: This was my biggest surprise. I expected the 235B models to crush the 8B models in every category, but when I added those "IMPORTANT / VERY IMPORTANT" flags to the v2 prompt, suddenly my little 8B model was producing coherent stories while ignoring the flags led to complete nonsense. It was like watching a struggling student suddenly excel when given clearer instructions. For smaller models (<30B), the v2 scaffolding wasn't just helpful—it was the difference between usable output and digital gibberish.

  2. Self-Editing Chains are a New Standard: The most fascinating discovery was watching models actually improve their own work in real-time. When I required models to plan, write, and then self-edit, the quality jumped dramatically across the board. DeepSeek-R1 was particularly impressive here—it would write a scene, then genuinely critique its own work: "This dialogue feels stilted, let me revise..." It felt like having a writing partner who could catch their own mistakes.

  3. Parameter Count Still Matters, but MoE is Closing the Gap: The leap from my 30B to 235B Qwen model was dramatic—like upgrading from a talented amateur to a professional writer. But here's what shocked me: DeepSeek-R1, using Mixture-of-Experts architecture, produced stories that rivaled Claude and GPT-4. We're witnessing the democratization of high-quality AI writing, and it's happening faster than I expected.

  4. Context Window is the Silent Killer: This one hurt to discover. My 30B model would start brilliantly, crafting beautiful prose and compelling characters. Then, around 3,500 words, it would suddenly forget Adrian's name, repeat entire paragraphs, or introduce contradictory plot elements. It was like watching someone develop amnesia mid-conversation. The 4k context limit turned what should have been a strength into a fatal weakness. Meanwhile, my "smaller" 8B model with its 8k context window never missed a beat.

  5. Commercial Models Excel at Polish and Safety: Testing Claude Opus 4 was a revelation—the prose was so naturally human that I had to double-check I hadn't accidentally copied text from a published book. GPT-4.1 brought that same polish but with a playful charm that made me smile while reading. Gemini 2.5 Flash surprised me with genuinely poetic language that felt like it came from a completely different creative mind. These aren't just tools anymore—they're writing partners with distinct personalities.

  6. The Best Open-Source Choices: After all my testing, two models stood out for different reasons. DeepSeek-R1 became my go-to for projects where I wanted a unique voice and didn't mind some cleanup work—it writes like a quirky, talented friend who occasionally goes off on fascinating tangents. Qwen-235B became my "professional" choice—reliable, polished, and ready for clients who need clean copy fast.

What This Actually Means for Your Writing Process

Let me be honest about what I learned from actually using these tools in real writing scenarios:

If You're Just Starting with AI Writing: Start with Qwen-8B at temperature 0.6-0.7 using the v2 prompt. Yes, you'll need to polish the output, but you'll get a solid 3,000-word draft that gives you something real to work with. I've used this setup to break through writer's block more times than I can count. The key is managing your expectations—think "rough draft" not "final copy."

If You're Running a Small Publishing Operation: Qwen-235B or DeepSeek-R1-0528 will change your workflow. I tested this with actual client projects, and both models consistently produced drafts that needed only light copy-editing before moving to illustration. The time savings are substantial—what used to take me a full day now takes 2-3 hours including editing.

If You Work in Corporate or Educational Settings: LearnLM Flash and Gemini 2.5 Pro are your safest bets. I tested these with content that needed to pass strict review processes, and they consistently delivered appropriate, professional material. No surprises, no content that makes legal teams nervous.

If You're in Creative Studios or Want Maximum Artistic Control: Gemini 2.5 Flash produces the most distinctive voice—genuinely lyrical prose that stands out. Just budget extra time for trimming; it tends to be about 15% wordier than needed. Claude Opus 4 gives you the most "invisible" AI assistance—prose so natural you can build on it seamlessly.

Comparative Cost Analysis per Story

ModelAPI ProviderInput Price ($/1M tokens)Output Price ($/1M tokens)Estimated Cost per Story
GPT-4.5OpenAI$75.00$150.00~$0.63
Claude Opus 4Anthropic$15.00$75.00~$0.31
o1OpenAI$15.00$60.00~$0.25
Claude Sonnet 4Anthropic$3.00$15.00~$0.06
GPT-4.1OpenAI$2.00$8.00~$0.03
DeepSeek-R1-0528Novita.ai$0.70$2.50~$0.01
Qwen-235BNovita.ai$0.20$0.80~$0.003
Gemini 2.5 ProGoogleN/A (Free Tier)N/A (Free Tier)$0.00
xAI Grok 3xAIN/A (Free Webchat)N/A (Free Webchat)$0.00
Gemini 2.5 FlashGoogleN/A (Free Tier)N/A (Free Tier)$0.00
LearnLM 2.0 FlashGoogleN/A (Free Tier)N/A (Free Tier)$0.00

Note on Pricing

This table reflects a mixed-access testing environment:

  • Direct API Costs: Pricing for OpenAI, Anthropic, and Novita.ai models is based on the specific pay-as-you-go data you provided.
  • Free Tier/Web Access: Costs for Gemini models, LearnLM, and Grok are listed at $0.00 based on your access to free tiers and webchat services during the test.

What I Learned About the Future of Writing

After months of testing, I've come to a surprising conclusion: we're not heading toward AI replacing writers. Instead, we're moving toward a new kind of creative collaboration that I find genuinely exciting.

The best models don't just generate text—they demonstrate reasoning, catch their own mistakes, and even show creative judgment that feels remarkably human. When DeepSeek-R1 would pause to critique its own dialogue or when Claude would seamlessly adjust tone mid-scene, I realized I wasn't just using a tool—I was collaborating with something that understood storytelling.

For authors, this changes everything. The study proves that AI can now produce genuinely usable first drafts, but success depends entirely on understanding each model's personality and limitations. The days of generic prompting are over. Professional results require professional prompt engineering, careful model selection, and realistic expectations about post-processing work.

For AI enthusiasts, the findings reveal something equally important: the future of AI writing tools may depend as much on prompt innovation as on raw computational power. The dramatic improvement I saw with the v2 prompt suggests we're still in the early days of learning how to communicate effectively with these systems.

But here's what excites me most: we've crossed a threshold. AI writing tools can now meaningfully augment human creativity rather than just generating content. The best models don't just follow instructions—they show their work, explain their choices, and even surprise you with creative solutions you hadn't considered.

Bottom Line: With a robust, reasoning-heavy system prompt, even mid-tier local models can produce competent drafts that save real time and spark genuine creativity. However, for those seeking instant, low-maintenance, and truly human-sounding prose, the frontier commercial models and latest MoE architectures have achieved something genuinely remarkable.

The future of AI-assisted writing isn't about replacement—it's about sophisticated collaboration. And after testing 16 models and reading hundreds of AI-generated stories, I can say with confidence: that future is already here.

What's your experience with AI writing tools? Have you tried any of the models I tested, or are you using others that caught your attention? I'm particularly curious about your temperature settings—did you discover that narrow "sweet spot" I found, or have you had success with different configurations?

And here's the big question: If you had to choose just one AI model to help write your next children's story, which would it be based on what you've read here? Are you leaning toward the reliability of commercial models like Claude, the unique voice of DeepSeek-R1, or the accessibility of smaller local models?

Drop your thoughts, experiences, and questions in the comments below. I read every response and love hearing about real-world AI writing experiments!

Read all the stories here

Changelog

2025-06-14

  • Corrected a technical oversight: clarified that Qwen3‑235B‑A22B and DeepSeek‑R1 (MoE) were not run locally but accessed via the Novita.ai API. Moved these models from the Local / Self-Hosted Models table to the Commercial API Models section and clarified DeepSeek-R1-0528 model used.
AI WritingLarge Language ModelsCreative WritingModel BenchmarkingChildren's Stories

Comments

I Tested 16 AI Models to Write Children's Stories - Here's Which Ones Actually Work (And Which Don't) | AI Muse by kekePower