Homework 06

Author

Peter Ralph

Published

February 9, 2026

Parsing text

In this homework, I’d like you to parse some text. Specifically, please find the script to a movie on IMsDB, and write python code that

  • downloads the HTML page with the script of the movie on it; and
  • parses the HTML and extracts the script, in a nice data-friendly form.

The output should include - and, distinguish - both set direction and dialog, and should provide available context (e.g., who says what lines). Specifically, I’d like your script to output two things:

  • A text file, with one line per “chunk of text” in the script. The “chunks of text” would be the words spoken by a character, the set directions, or perhaps other annotations (e.g., settings; scene names).

  • A CSV file with the same number of lines as the text file, which has for each corresponding line in the text file, at least:

    1. what kind of line is it (e.g., “direction”, “dialog”, “setting”);
    2. who says the line, if it is dialog;
    3. other information (e.g., what the setting is, if it is describing the setting).

Note: all of the words spoken by a single character, uninterrupted, should be in one line, even though this may be spread across a paragraph with interspersed linebreaks in the script.

Please turn in the following:

  1. your python script carrying out the above operations
  2. the resulting two files
  3. a ipynb file structured as a short report that reads in those files and does a little description: Who are the characters? How many lines do they each have? Perhaps a word cloud for each? Etcetera.

For instance:

Consider the following chunk of script:


MR HALL



    Should all oppressed people be allowed refuge in America?
    Amber will take the con position. Cher will be pro. Cher, two minutes.



CHER 



    So, OK, like right now, for example, the Haitians
    need to come to America. But some people are all "What about the strain
    on our resources?" But it's like, when I had this garden party for
    my father's birthday right? I said R.S.V.P. because it was a sit-down dinner.
    But people came that like, did not R.S.V.P. so I was like, totally
    buggin'. I had to haul ass to the kitchen, redistribute the food, squish
    in extra place settings, but by the end of the day it was like, the more
    the merrier! And so, if the government could just get to the kitchen, rearrange
    some things, we could certainly party with the Haitians. And in conclusion,
    may I please remind you that it does not say R.S.V.P. on the Statue of
    Liberty?


(Class breaks into applause)

This could have something like the following. First, the text file would have three lines:

So, OK, like right now, for example, the Haitians need to come to America. But some people are all "What about the strain on our resources?" But it's like, when I had this garden party for my father's birthday right? I said R.S.V.P. because it was a sit-down dinner. But people came that like, did not R.S.V.P. so I was like, totally buggin'. I had to haul ass to the kitchen, redistribute the food, squish in extra place settings, but by the end of the day it was like, the more the merrier! And so, if the government could just get to the kitchen, rearrange some things, we could certainly party with the Haitians. And in conclusion, may I please remind you that it does not say R.S.V.P. on the Statue of Liberty?
Should all oppressed people be allowed refuge in America? Amber will take the con position. Cher will be pro. Cher, two minutes.
(Class breaks into applause)

And, the CSV file should have two lines plus header; perhaps:

what,who
"dialog","MR HALL"
"dialog","CHER"
"direction",""