Can machine learning assess the readability of texts?
In May, we kicked off another Project Friday. This time, a group of seven Cmotions colleagues, of different origins and levels, came together. As always, the aim was to make a brilliant deliverable, to learn a lot, and (most of all) to have a lot of fun.
The first challenge was to come up with a topic, which would lead to a cool product that everyone is excited about. After brainstorming for a while, it was soon decided to enter the CommonLit Readability Prize (a Kaggle competition) focusing on the following question:
βTo what extent can machine learning identify the appropriate reading level of a passage of text, and help inspire learning?β
This question arose as CommonLit, Inc. and Georgia State University wanted to offer texts of the right level of challenge to 3rd to 12th grade students, in order to stimulate the natural development of their reading skills. Until now, the readability of (English) texts, has been based mainly on expert assessments, or on well-known formulas, such as the Flesch-Kincaid Grade Level, which often lack construct and theoretical validity. Other, often commercial, solutions failed to meet the proof and transparency requirements.
From May to August β21, we worked on developing a Python notebook that is able to assess readability scores better than current standards. We worked with a training dataset of about 3.000 text excerpts that were rated by 27 professionals. In these months, we investigated which features affect readability the most, which models and architecture are best to use, and how to tweak them. The name of our team? Textfacts!
Certainly, you want to know if we succeeded in predicting text readability better than existing solutions, right? Stay tuned and follow our blogs for our stories! Do you have any questions or do you want to know more about this cool project right now? Please click here to visit the competition webpage or contact us, via info@theanalyticslab.nl.