In the midst of our random data exploration, Laure and I started playing around with Hadley’s movies dataset and noticed that there were a lot of old cartoon animations… I mean REALLY old. So we got excited and wondered if we could find Youtube links for all these old animations. Indeed we could! Here’s how we did it. As always, the full analysis is on github.
- In R, Load in
ggplot2movies::movies
, which has something like 60 000 movies - Reduce this dataset to animations of only 10 minutes or less and then arrange by year and descending rating, then select only
1 best animation per year. Hrm… actually, select each year’s 3 best-rated cartoons. You’ll see why later. This reduced dataset has <300 rows… much more manageable. Of this, only keep the title and year, you don’t need the rest. - Now let’s create a column that will give us good search results… in order to do that, prepend with cartoon and put the year between parentheses so that each row looks like this: “Cartoon ”It’s the Cat” (2004)” (the backslashes escape the quotes inside the string… dontworriboutem).
- Now feed each one of these lines into bing.com (google/yahoo don’t let us!) and capture the results using httr::GET()
- Now we have the whole webpage with the results inside. We use XPath to try our best to grab only the parts of the webpage that we want, namely the search results. Find the first Youtube hit. It takes a LOT of cleaning to figure out what you need and what needs to be thrown out. From this, grab the title of each search result page and the link (by the way, at this point, we have a shorter list because not every movie has a Youtube link).
- Now we have to face the reality that the search result may have given us a Youtube link that wasn’t the film we wanted. In order to understand whether we did or not, we use a package called stringdistance. This measures the difference between the title we were looking for and the title of the Youtube hit we got. Sometimes you look for “Unsteady Chough, The” and get back “The unsteady chorough”. So for this analysis, I found the _qgram _method to perform best. Unfortunately, sometimes even the string distance alone isn’t a good predictor. One or two misspellings in a film with 3 letters is a bigger deal than 5 misspellings in a film with 30 letters. Therefore:
- Come up with a Percent parameter where you divide the stringdistance() by the number of characters in the title you were looking for. This will give you a good indication of how much error there is per string.
- Based on an evaluation of the string distances and percents, we realized that very few cartoons have good correspondence between the title looked for and the title found.This is why one movie per year doesn’t work and we had to go back and select the 3 best cartoons.
- After analyzing the results carefully, we decided to cut the dataset to only stringdistance() <8 and Perc <100 (although results vary a lot here). This left about a hundred cartoons.
- So now that you have good cartoons and their Youtube links, put some html code in front and in back, and use cat() to push it all out into an html file!
Final thoughts
- We tried to find the images for each video entice viewers to watch the cartoon, but this was reeeeeally hard to do! All the search engines we found push the images as data or in iframes so we couldn’t capture the image with GET. We finally gave up 🙁
- We could have probably found more hits if we had looked at more movie sites than youtube, but we didn’t like the idea of going to whatever site… who knows what ads they might have popup or whatever.
- It’s lots of fun to see how cartoons have changed through the ages! It’s amazing to see what they could already do in 1922 and how quickly cartoons improved over time! Click around and see for yourself!
- Some of the early attempts were full of errors, but in a way it feels a bit psychedelic to click on these links… it’s Youtube roulette! I dare you! Click around »here«.
RESULTS HERE BELOW! Click to go to the youtube link!
1906 – Humorous Phases of Funny Faces
1911 – Little Nemo
1916 – R.F.D. 10,000 B.C.
1916 – Krazy Kat, Bugologist
1919 – Feline Follies
1922 – Puss in Boots
1925 – Alice’s Egg Plant
1928 – Vormittagsspuk
1929 – Springtime
1931 – Bimbo’s Initiation
1932 – Flowers and Trees
1933 – Une nuit sur le mont chauve
1933 – Three Little Pigs
1934 – Tale of the Vienna Woods
1934 – China Shop, The
1935 – Three Orphan Kittens
1937 – Lonesome Ghosts
1938 – Porky in Wackyland
1939 – Peace on Earth
1939 – Blue Danube, The
1941 – Dance of the Weed
1941 – Fox and the Grapes, The
1942 – Horton Hatches the Egg
1943 – Fighting Tools
1943 – Red Hot Riding Hood
1947 – Tubby the Tuba
1948 – Cat That Hated People, The
1948 – Mouse Wreckers
1950 – Morris the Midget Moose
1950 – Ventriloquist Cat
1951 – Dude Duck
1951 – Symphony in Slang
1952 – Rock-a-Bye Bear
1955 – You and Your Senses of Smell and Taste
1959 – Short and Suite
1960 – High Note
1962 – Self Defense… for Cowards
1962 – Now Hear This
1965 – Go Go Amigo
1967 – Bear That Wasn’t, The
1968 – Windy Day
1969 – Corbeau et le renard, Le
1969 – Bambi Meets Godzilla
1970 – Is It Always Right to Be Right?
1971 – Thank You Mask Man
1972 – Balablok
1978 – Afterlife
1978 – Special Delivery
1979 – Harpya
1979 – Canada Vignettes: Log Driver’s Waltz
1979 – Every Child
1985 – Broken Down Film
1985 – Concerto Grosso Modo
1986 – Luxo Jr.
1988 – Cat Came Back, The
1995 – Life of Larry, The
1995 – Chicken from Outer Space, The
1999 – Pinocchio