Learner//Meets//Future: AI-Enabled Assessments Challenge
Frankenstories & Writelike
Provide a one-line summary of your solution.
Using LLMs to evaluate short-form student writing across genres.
What type of organization is your solution team?
For-profit, including B-Corp or similar models
What is the name of the organization that is affiliated with your solution?
Writelike
Film your elevator pitch.
What is your solution?
We want to use foundation LLMs to provide continuous formative feedback on short-form collaborative writing, but it's still harder than it sounds.
Context
We work on a couple of resources that help middle-school students learn higher-order writing skills, the kind they use when writing entertaining stories, detailed analyses, and compelling arguments.
Writelike
Writelike is a mentor-text modeling resource that began in 2013 with a Gates Foundation LitChallenge grant. It teaches narrative, argumentation, persuasive and informational writing through short snippets of mentor text with strategic instruction and worked examples.
Writelike is effective and has some long-term fans, but it's always been limited by an inability to provide immediate automated feedback. Over the years, we have experimented with NLP services and pre-GPT3.5 LLMs but we could not find anything that gave accurate, meaningful, and consistent feedback on the type of writing tasks that Writelike gives to students: short-form responses with high variable, more-semantic-than-grammatical criteria for 'success'.
In 2021, we decided to focus on real-time peer feedback, which led to Frankenstories.
Frankenstories
Frankenstories is a fast-paced, multiplayer, online creative writing game suitable for Grades 4-12. It's designed to make writing so fun that students don’t want to stop.
Players play in groups of any size: individual, pairs, small groups, whole class.
Everyone gets the same prompt, then players write in timed rounds.
When the round timer runs out, players read anonymised submissions and vote for their favorite.
The winner gets stitched into the spine of the story/argument, and the process repeats.
This format creates an engaging, social, hivemind-like writing experience that is inviting to even the most reluctant writers. In this format, learning is as much about voting and collaborating as it is about writing.
Teachers can channel that enthusiasm into structured learning using our library of skill-based curriculum prompts or by creating custom games.
And, while narrative is often the entry point for Frankenstories, it can be used for any kind of writing (analytical, informational, persuasive, poetic) and in any subject: ELA, social studies, history, science, foreign languages, etc.
The AI part
So we have two sites where students generate a lot of short-form text with context-dependent criteria, and now teachers want insight into student learning over time.
As a baseline, we want to classify student responses against grade level (traffic light style) on dimensions such as content, coherence, clarity, and collaboration and provide teachers with a longitudinal heatmap that they can use to direct their work.
This is harder than it sounds.
We've been evaluating foundation LLMs using real student data and few-shot prompting. The results are still inconsistent but promising enough to warrant more work. We hope that a combination of model improvements and in-context examples will be enough to produce reliable results.
If we could get classification working reliably and receive positive feedback from teachers, we would consider exploring generative feedback.
How will your solution impact the lives of priority Pre-K-8 learners and their educators?
The broad goal is to increase the proportion of Grade 4-8 students who are at grade level or above in writing tests. That also means increasing grade-level performance in writing-heavy subjects, e.g. ELA, history, social studies, & science.
This is also about developing pro-school attitudes, so we would hope to influence increases in school attendance, completion, reduction in behavioral incidents, etc.
To understand how we get there, it's important to understand that this pitch is about using LLMs to enhance Writelike & Frankenstories—so our logic model is really about the underlying logic of those tools.
Scenario
Imagine you're teaching Grade 5 in a Title I school where learners are already several grades behind in basic literacy. One challenge (among many) is that the learners don't engage in reading and writing tasks; they find it difficult and unrewarding and there's limited support at home.
You begin playing Frankenstories. At first, the kids write Skibidi memes to get laughs from their peers, and they all vote for ridiculous responses. Yes, you have them writing, but the quality is not great. The LLM can recognise low-effort responses, so your first report is a column of red.
Depending on the class, you can go in a couple of directions. Let's say the majority of learners struggle to write even a single idea. In that case, you might focus on teaching students how to systematically observe, collect, and value fragmentary details rather than worry about coherent prose.
From a data perspective, you pay attention to length and content because you want students to capture lots of details, and you pay attention to voting quality because you want students to recognise and vote for detailed responses.
Imagine the heatmaps are a now mix of red, yellow, and green. You split the class into differentiation groups and focus on instructing the reds and yellows while the greens consolidate. You're teaching writing skills, plus underlying creative and collaboration skills. You have the students writing in a context they think is fun, but you're getting them to approach it with a level of craft discipline.
You then start to work up the ladder of complexity, getting students to write more coherent sentences, then paragraphs, and targeting more sophisticated skills (from collecting details to modeling cause and effect to making meaning). You use the heatmaps to get a sense of how the group is progressing and who needs what support—and provide some evidence to admin that your approach is working.
That is a baseline Frankenstories scenario. As students build motivation for writing, you might start to incorporate Writelike lessons and use the AI to help evaluate whether students can apply specific skills.
Note that the LLM data is teacher-facing and formative. We'd want to make sure it was completely reliable before showing it to students and we have no plans to provide summative evaluation.
How are you and your team (if you have one) well-positioned to deliver this solution?
Unique combo
Our co-founders combine three critical disciplines: public school teaching, writing, and digital educational product design (including game design).
School relationships
Because we are not in a classroom day to day, we talk to teachers using Frankenstories on a weekly basis and respond to feedback with new features, content, or product refinements. (In the last 18 months we have received written or verbal feedback about Frankenstories from more than 150 teachers across four countries, including many from remote and underprivileged communities, Title I schools, etc.)
As an example of being guided by the community, we had an existential crisis when ChatGPT came out and were very tempted to dive immediately into developing the kind of feedback system we've described here. We showed teachers a range of prototypes plus other non-AI features, and we were surprised to find that everyone wanted us to invest effort in the non-AI stuff (e.g. in Frankenstories that meant round-based scaffolding, preset skill prompts that link to Writelike, group management features, moderation tools).
The teachers' reasoning behind this was that Frankenstories created a unique, inherently social and collaborative environment that made students want to write, and everyone wanted features that enhanced that experience first and foremost.
Now that most of those requests have been fulfilled, we are getting more requests for reporting and evaluation tools (because teachers are spending more time on the platform and want to understand their own data, get a sense of student progress, and make a case to their admin/districts for purchasing).
On the AI front
We're not ML specialists, but we have two platforms that generate unique student writing data.
We have conducted enough tests of NLP and LLMs over the last 10 years to have a sense of what we're looking for and enough failed tests to be cautious.
Which dimension(s) of the challenge does your solution most closely address?
Which types of learners (and their educatiors) is your solution targeted to address?
What is your solution’s stage of development?
ConceptIn what city, town, or region is your solution team headquartered?
Brisbane QLD, AustraliaIs your solution currently active (i.e. being piloted, reaching learners or educators) within the US?
No, but we have plans to be
Who is the Team Lead for your solution?
Andrew Duval
What makes your solution innovative?
The thing we try to innovate on is providing tools that help students learn complex genre patterns essential to more advanced writing. (Less about spelling and grammar, 3-act structure or 5-paragraph essay, and more about contrast, emotion, cause and effect, reasoning, classification, and so on.)
Writelike
When we started working on Writelike, 10 years ago, it was innovative because most writing instruction focused on strategic instruction rather than short-form mentor-text modelling. It's less innovative now that mentor-text modelling is much more common, though it still has an innovative approach to curriculum design and social feedback.
Frankenstories
Frankenstories is innovative because it turns writing into a team sport. It's not just that students are writing together, but that the writing process is broken into rounds with pressure timers and peer voting, and you can have teams 'competing' on versions of the same prompt.
This formula is new and it produces a creative, collaborative hivemind effect through speed and compression.
The effect is so striking that we kind of expect the Frankenstories format to become a standard part of the way that middle schoolers are taught to write, whether through Frankenstories itself or as a generic approach in other implementations.
Frankenstories links to Writelike's curriculum to give teachers skill pathways through different genres such as narrative, argument and persuasive writing, in any subject.
Its closest analog is Theatresports, but for writing instead of stage.
Big-picture, part of the innovation is that we give students a really compelling reason to want to write, which means generating rich and varied data that would not otherwise exist.
The opportunity with LLMs
One of the big challenges with both tools described above is it has historically been impossible to use automated assessment because students are writing short-form responses with varied criteria for 'success' (often semantic rather than grammatical).
However, the new LLMs are pretty perceptive and can infer a lot from context supplied with the prompt, which means they can now be used as assessment tools for variable short-form tasks.
The design challenge is figuring out exactly how to deploy them because they also have risks and limitations (e.g. inconsistent reasoning, bland responses, a tendency to over-correct, rigidity from RLHF—and each model has idiosyncratic strengths and weaknesses).
We have the advantage (or disadvantage, TBD) of an existing context in which to use a model and calibrate its deployment.
This means we are able to begin with a light-touch, classification-oriented, teacher-facing deployment and then iterate from there.
Effects
Our best outcome would be if we finally cracked the problem of reliably teaching the component skills of complex writing (and not just letting students outsource their thinking to the models).
We also believe we can make a contribution by trialling different approaches to evaluating such a unique formative data set.
Whether or not we succeed, we will be an interesting demonstration of an embedded, low-key, 'AI second' deployment of a model when so much attention is being given to generative 'AI first' approaches.
Describe the core AI and other technology that powers your solution.
The AI part is LLMs, third party. Models can be swapped as needed based on ongoing evaluations.
Everything else is standard web stack: Azure, .NET, React.
How do you know that this technology works?
There's no simple answer to this question.
Right now, we can't get GPT4-Turbo (which tends to score the best on our evaluations) to consistently and reliably do what we want (e.g. apply a state-based rubric to the kind of text that students produce in Frankenstories) and generalise across contexts.
We tend to get it working for one grade but then the approach won't generalise another, or one rubric but not another. Too often we see complete wildcard results, where the model seemingly disregards our criteria and imposes its own logic (e.g. classifying a Grade 2 response as Grade 5 because it displayed 'conflict' when we're asking it to look for features like run-on sentences).
We know that model capabilities are progressing and GPT3.5 crossed a clear threshold of viability. (There were all sorts of tests we would run that always failed before GPT3.5 and then suddenly passable out of the box, zero-shot, e.g. distinguishing between a semicolon functioning as a proxy 'so' vs 'because' in a sentence.)
However, it is still not consistent enough to say it will work in all, or even most, situations.
We are currently working on the expectation reliability can be resolved through prompt engineering e.g. being more precise with standards in rubrics and providing more effective examples.
We also expect model capability improvements to solve some problems 'passively'.
However, language models can have surprising failure cases and our work involves a lot of subtlety, so we also have the option of releasing capability in layers—this isn't an all-or-nothing approach.
Even without LLMs, releasing a dashboard that uses word count as a proxy indicator for quality (including quality of voting/collaboration) will be a step forward for teachers.
We can then release AI assessment in layers on top of that, based on reliability—i.e. we can limit it to certain rubrics, grades, and/or dimensions.
There's only so much evaluation we can do internally. At some point, we have to release components to teachers and get granular real-world feedback.
Fortunately, all our writing tasks are formative and low-stakes, and AI feedback will be teacher-facing only. That means we have a little wiggle-room for experimentation and feedback if we deploy something live at scale (so long as the AI feedback isn't completely useless).
At the end of the day, we feel like something should be possible but we still don't know for sure. It's certainly more tricky than we would have expected given current LLM capabilities, but we expect there to be at least some amount of value that can be unlocked, even if it's initially limited.
Our evaluation data is pretty convoluted because there are a lot of permutations, but to give you a sense of what we do when we test models, here is a sample based on a single 3rd-grade class.
What is your approach to ensuring equity and combating bias in your implementation of AI?
We think the first step is just being aware of the issue and minimizing the surface area for problems.
That's part of the reason we want to begin with such a narrow implementation: classification not generation, summative not formative, teacher-facing not student-facing.
All of these decisions limit the potential harms of the AI.
Even within these bounds, we know that the LLMs tend to steer towards a generically white, middle-class, middle-aged, urban liberal form of language and values which can lead to misleading evaluations of student writing. (Not just for Black and Latino learners or those experiencing poverty—our sense is that the models are biased against the anarchic and extreme inclinations of children in general.)
To deal with that, we can perform granular evaluation across a wide range of data sources, across Grades 3-12 with different countries, communities, and demographics. (We've been doing this manually, but we can scale it up with a live implementation.)
However, it's not entirely clear what we would do about any problems we find from that process.
It would be nice to say that we can tune prompt instructions and examples—but there's no guarantee that will fix anything and/or not break something else.
It would also be nice to say we could give teachers some control via the prompt, specifically the examples and standards sent to the model. That way teachers can set their own standards using the language of their class. But again, there is no guarantee that will work on a case-by-case basis, and we don't want to turn teachers into model evaluators; it's frustrating enough when it's your actual job!
There's a possibility of fine-tuning foundation models using student data, but there are questions around the feasibility of annotating training data consistently and accurately against criteria which could be widely open to interpretation.
At the end of the day, we just don't know, so our baseline is to minimize risk by limiting capabilities and trying to be granular and mechanical about how we're assessing any given dimension. We'll need to iterate from there.
As part of that process, we expect to be transparent and collegiate with teachers, meaning we're not interested in presenting the veneer of a "magical AI". Rather, we want to show teachers that we are all using what is essentially a public utility with the temperament of a hyperdimensional cat, so teachers need to treat results as provisional and possibly either customize their inputs or edit the outputs.
How many people work on your solution team?
4 full-time
2 part-time
How long have you been working on your solution?
2.5 years
If your solution has a website or an app, provide the links here:
https://frankenstories.writelike.org/
What is your plan for being pilot ready (if not already) within the next year, and what evidence can you provide that you are on track to meet your goals?
We're likely to begin development of a reporting tool (to be deployed in Frankenstories and Writelike) in July, barring any urgent user requests or feedback.
The first release will assess using word count, not AI.
After that, probably in August, we will move to an LLM integration. We're most likely to launch with one custom rubric tailored to the strengths and limitations of the LLM (as opposed to a U.S. or Australian rubric). We would hope to assess content and clarity from Grades 3-12, but that is TBC.
Technically, it's straightforward web development with minimal technical risk, though there are challenges related to UX, information density, and the degree of user control.
The real risk is the reliability of the foundation model, as discussed. We will continue evaluating model outputs and testing our approach.
We expect to have some form of AI-enabled assessment in testing with existing users by September/October 2024.
This will include some teachers of priority learners by your definition.
What are your plans to ensure your solution is available, accessible, and affordable to priority learners at scale?
Currently:
- We have a freemium SaaS model within a standard company structure.
- We're a small team that can operate quite efficiently. We don't need a huge amount of revenue.
- We want pricing to be a no-brainer and affordable for individual teachers so they don't have to try to convince their admin to pay, though ideally schools/districts will purchase for their teachers.
- Teachers can play casual Frankenstories games for free. These games can have any number of students and the teacher can customize the prompt text and image to an extent.
- We have a Pro subscription, which is US$60 per teacher per year and includes unlimited classes and students. This subscription provides more customization, content, and class management features.
With the AI feedback:
- AI feedback would only be available to paid subscribers.
- We might need to consider a pricing tier that accounts for volume of API calls.
- The pricing depends on token costs, which are going down. Currently, we think API calls for an average user might cost between $5 and $15 per year for classification in Frankenstories alone (ignoring Writelike and any additional AI features).
- We could look into serving our own instance of an open-source model.
- However, we would likely need to add something like $15-30 per teacher or $2-5 per student per year.
We would happily offer the service for free to teachers if we could find institutional subscribers.
Why are you applying to the Learner//Meets//Future Challenge?
We're most interested in feedback and evaluation opportunities, whether through informal networks or structured programs.
We're also interested in learning what others are doing and sense-checking our approach/assumptions/beliefs. We might be able to integrate other people's tools.
Grant funding to support development is always good.
In which of the following areas do you most need partners or support?
Solution Team
-
Andrew Duval Co-Founder, Writelike
to Top
What is the name of your solution?
Frankenstories & Writelike