Posts tagged "NewsBeastLabs"
Today we published a data story looking at how iOS devices fail to accurately correct some words such as “abortion” and “rape.” Here’s a detailed methodology on how we did that analysis.
It started back in January when we were working on our project mapping access to abortion clinics. The reporters on the project, Allison Yarrow and myself (Michael Keller) were emailing a lot about the project, which led to us typing the word “abortion” into our phones on a fairly regular basis. We noticed that iOS never autocorrected this word when we misspelled it, and when we would double-tap the word to get spelling suggestions, the correctly spelled word was never an option. We decided to look further into whether this could be repeated on iPhones with factory settings and what other words iOS doesn’t accurately correct. To do this, we decided to find a complete list of words that iOS software doesn’t accurately correct.
We did this in two stages:

Stage One: Use the iOS API’s built in spell-checker to test a list of misspelled words programmatically.


Step 1: Get a list of all the words in the English language
We combined two dictionaries for this: The built-in Mac OS X dictionary that can be found in /usr/share/dict on a Mac and the Wordnet corpus, a widely-used corpus of linguistic information, which we accessed through NLTK, a natural language processing library for Python. We left out words shorter than three characters, words in the corpus that were two words (e.g. “adrenal gland”), and words with punctuation such as dashes or periods (e.g. “after-shave”, “a.d.”). We reasoned that these words were either too short to accurately correct or had more variables to them than we would be able to test on an even playing field, so we left them out of our analysis.

Step 2: Create misspellings of these words
We wanted to test slightly misspelled versions of every word in the English language so, to start, we wrote a script that produced three misspellings of each one: one where last character was replaced with the character to the left of it on the keyboard, one where the last character was replaced with the character to the right of it on the keyboard, and a third one where the last character was replaced with a “q”. Because modern spellcheck systems know about keyboard layout, these adjacent-character misspellings should be the low-hanging fruit of corrections.
For instance, “gopher” would become “gophet,” “gophee,” and “gopheq”.

Step 3: Run these misspelled words through an iOS API spellchecker program.
Apple doesn’t have a “spellcheck program” but for iOS developers, it has an API with a function that will take in a misspelled word and return a list of suggested words in the order of how likely it thinks each suggestion is. In Xcode, the program you use to write iPhone and iPad Apps, you can use a function under the UITextChecker class, called “guessesForWordRange” which will do just that. Before testing each word, however, we ran the non-misspelled word through a function in this class called “rangeOfMisspelledWordInString” which will tell you whether the word in question exists in the iOS dictionary. This meant that we weeded out words that were in our Wordnet and Mac dictionary lists but that iOS wasn’t aware of. In other words, we only tested words that if you spelled them correctly on an iOS device, they wouldn’t get the red underline. For all of our tests we used the then-most up-to-date version of Xcode, 4.6.2, and ran the most up-to-date version of the iOS 6 Simulator.
We also tested whether the misspelled word was in the dictionary and to make sure our misspelled word wasn’t also a real word. For example, “tab” has a right-adjacency misspelling of “tan” which is also a word. In that case, the script fell back to the “q”-misspelling. So if it was testing “tan” as a mispelling for “tab” it would see that “tab” is a real world and throw “taq” at it as the misspelling. Obviously, “taq” is a harder misspelling of “tab” to correct, but we also gave it “tav”, its left adjacency misspelling. If it got either of these right we would count “tab” as a word that it can accurately correct. Later on we did many more misspelling combinations as our list got smaller to be sure we gave the spellchecker many chances to correct what should be easy corrections.

Step 4: Analyze results
If a word was accurately corrected at least once, we marked it as properly recognized by iOS. This process narrowed our list down from about 250,000 to roughly 20,000 words. There was one big problem though: the iOS spellcheck didn’t accurately correct some words that real iPhones were able to correct. For instance, the API wouldn’t correct “aruguls” to “arugula,” for some reason. Our questions to Apple on this went unanswered; if anyone has any suggestion as to why the two systems are different, please let us know.
After meeting with some New York-area iOS developer meetup groups, we found that the spellcheck on the iOS simulator as a part of Xcode does correct these edge cases, which led us to stage two.

Stage Two: Use spellcheck on the iOS simulator to check the remaining 20,000 words
To access the word suggestions on the iOS simulator, you need one crucial piece of hardware: a human hand. We were able to write an iOS program easily enough that presents a word on the simulator, but there’s no way to programmatically pull up the spellcheck suggestion menu because iOS programs don’t have scope for system level operations. To do that, you need to physically double-click the word and navigate through the various menus. 

Step 1: Find a way to automate clicking
To solve this, we got into our wayback machine and wrote an AppleScript that would move the mouse to specific coordinates on the screen, wait a specified number of milliseconds for menus to appear and then click in the appropriate places. Our iOS program had a button that, when clicked, saved the original word, the presented misspelled word, and the final result of the correction. Our AppleScript script clicked through the menus, replaced the word if the simulator presented a suggestion, then clicked the button to serve the next word. 
We tried to make this process as fast as possible but it ended up taking around 1.6 seconds per word. 1.6  multiplied by 20,000 is 32,000 seconds, which equals 8.8 hours. But we also wanted to present even more misspelling options—twelve more in total.  
We can call this Step 2, create more misspellings:
1. Double last character.
2. Double last character with a capitalized first character.
3. Missing last character.
4. Missing last character with a capitalized first character.
5. Misspelled first character (via left misspelling adjacency) and capitalized first character.
6. Misspelled first character (via left misspelling adjacency).
7. Misspelled first character (via right misspelling adjacency) and capitalized first character.
8. Misspelled first character (via right misspelling adjacency).
9. Misspelled second character (via left misspelling adjacency) and capitalized first character.
10. Misspelled second character (via left misspelling adjacency).
11. Misspelled second character (via right misspelling adjacency) and capitalized first character.
12. Misspelled second character (via right misspelling adjacency).

So, including our first misspelled last character with left/right adjacencies, we had 14 lists of 20,000 words to run through. 14 multiplied by 8.8 hours = 123.2 hours, which is five days if the program ran straight for 24 hours a day. We needed to take a break in between each of the 14 sessions, however, and restart Xcode just in case there was a learning algorithm—we didn’t want the results of one session to pollute another.
Renting computers from Amazon is easy and but not if they’re Mac OS computers, which aren’t available through Amazon and get rather expensive through other dealers. Fortunately, the Columbia School of Journalism let us take over one of their Mac computer labs and we were able to run script in parallel and finished in a much more reasonable time frame. I was also able to not have my laptop out of commission crunching words for a week. Here’s a Vine of what the automated corrections looked like: 

One drawback of this method was that we could only get the mouse simulator to select the first suggestion. So, in the scenario that for the misspelled word “abortiom”, “aborted” was suggested as more likely than “abortion,” this program would make that as an inaccurate correction. Wearen’t too worried about this, though, because 1) our iOS script in stage one *did* take into account multiple suggestions, so all the words had two chances to be corrected in that scenario, and 2) we presented 14 different misspellings of these words and if any one of these variations was correctly spelled then we counted that accurately corrected. If a word that is only off by one character isn’t suggested that many times, then something in the algorithm isn’t handling that word correctly.

Step 3: Analyze results
This second stage only cut out around 6,000 words, leaving us with 14,000 words that were never accurately corrected. The ++related article++[] lays out our findings but our initial hypothesis that “abortion” is a word that iOS doesn’t correct, unlike Android phones, held true. Apple declined to comment for this project so we have many unanswered questions. One idea for future research is whether iOS devices are incapable of learning certain words like “abortion.” That is to say, these words are blocked not just on the dictionary suggestion level, but on the machine learning level as well.

Stage Zero:  Find the files.
Before we did stage 1 we had a different strategy: find this list of seemingly banned somewhere in the iOS file structure. To do this, we put out a call on Facebook for any friends that would donate an old iPhone to be jailbroken. We got three phone: one from my mom, and two from some very nice old friends who mailed them to our offices. We factory-reset and jailbroke one and kept the others as factory-fresh for testing. We went searching and found some promising files in the LinguisticData directory called “pos” “ner” and “lemmas” which in the natural language processing world, stand for “part of speech”, “named entity recognition” and “lemmatization,” which is analyzing word stems and inflected forms like “better” being associated with “good” as its base. These files were unreadable, however, because they weren’t in any known format. The only way we could read them was in their raw binary-hex format which looks like that terrible mess of characters you see when you open a corrupted word document—like Wingdings but with less rhyme or reason.
After many attempts at deciphering where a list of blocked words could reside and reaching out to the New York iOS community, we started in earnest with reverse engineering this list ourselves with stage 1.

Today we published a data story looking at how iOS devices fail to accurately correct some words such as “abortion” and “rape.” Here’s a detailed methodology on how we did that analysis.

It started back in January when we were working on our project mapping access to abortion clinics. The reporters on the project, Allison Yarrow and myself (Michael Keller) were emailing a lot about the project, which led to us typing the word “abortion” into our phones on a fairly regular basis. We noticed that iOS never autocorrected this word when we misspelled it, and when we would double-tap the word to get spelling suggestions, the correctly spelled word was never an option. We decided to look further into whether this could be repeated on iPhones with factory settings and what other words iOS doesn’t accurately correct. To do this, we decided to find a complete list of words that iOS software doesn’t accurately correct.

We did this in two stages:

Stage One: Use the iOS API’s built in spell-checker to test a list of misspelled words programmatically.

Step 1: Get a list of all the words in the English language

We combined two dictionaries for this: The built-in Mac OS X dictionary that can be found in /usr/share/dict on a Mac and the Wordnet corpus, a widely-used corpus of linguistic information, which we accessed through NLTK, a natural language processing library for Python. We left out words shorter than three characters, words in the corpus that were two words (e.g. “adrenal gland”), and words with punctuation such as dashes or periods (e.g. “after-shave”, “a.d.”). We reasoned that these words were either too short to accurately correct or had more variables to them than we would be able to test on an even playing field, so we left them out of our analysis.

Step 2: Create misspellings of these words

We wanted to test slightly misspelled versions of every word in the English language so, to start, we wrote a script that produced three misspellings of each one: one where last character was replaced with the character to the left of it on the keyboard, one where the last character was replaced with the character to the right of it on the keyboard, and a third one where the last character was replaced with a “q”. Because modern spellcheck systems know about keyboard layout, these adjacent-character misspellings should be the low-hanging fruit of corrections.

For instance, “gopher” would become “gophet,” “gophee,” and “gopheq”.

Step 3: Run these misspelled words through an iOS API spellchecker program.

Apple doesn’t have a “spellcheck program” but for iOS developers, it has an API with a function that will take in a misspelled word and return a list of suggested words in the order of how likely it thinks each suggestion is. In Xcode, the program you use to write iPhone and iPad Apps, you can use a function under the UITextChecker class, called “guessesForWordRange” which will do just that. Before testing each word, however, we ran the non-misspelled word through a function in this class called “rangeOfMisspelledWordInString” which will tell you whether the word in question exists in the iOS dictionary. This meant that we weeded out words that were in our Wordnet and Mac dictionary lists but that iOS wasn’t aware of. In other words, we only tested words that if you spelled them correctly on an iOS device, they wouldn’t get the red underline. For all of our tests we used the then-most up-to-date version of Xcode, 4.6.2, and ran the most up-to-date version of the iOS 6 Simulator.

We also tested whether the misspelled word was in the dictionary and to make sure our misspelled word wasn’t also a real word. For example, “tab” has a right-adjacency misspelling of “tan” which is also a word. In that case, the script fell back to the “q”-misspelling. So if it was testing “tan” as a mispelling for “tab” it would see that “tab” is a real world and throw “taq” at it as the misspelling. Obviously, “taq” is a harder misspelling of “tab” to correct, but we also gave it “tav”, its left adjacency misspelling. If it got either of these right we would count “tab” as a word that it can accurately correct. Later on we did many more misspelling combinations as our list got smaller to be sure we gave the spellchecker many chances to correct what should be easy corrections.

Step 4: Analyze results

If a word was accurately corrected at least once, we marked it as properly recognized by iOS. This process narrowed our list down from about 250,000 to roughly 20,000 words. There was one big problem though: the iOS spellcheck didn’t accurately correct some words that real iPhones were able to correct. For instance, the API wouldn’t correct “aruguls” to “arugula,” for some reason. Our questions to Apple on this went unanswered; if anyone has any suggestion as to why the two systems are different, please let us know.

After meeting with some New York-area iOS developer meetup groups, we found that the spellcheck on the iOS simulator as a part of Xcode does correct these edge cases, which led us to stage two.

Stage Two: Use spellcheck on the iOS simulator to check the remaining 20,000 words

To access the word suggestions on the iOS simulator, you need one crucial piece of hardware: a human hand. We were able to write an iOS program easily enough that presents a word on the simulator, but there’s no way to programmatically pull up the spellcheck suggestion menu because iOS programs don’t have scope for system level operations. To do that, you need to physically double-click the word and navigate through the various menus. 

Step 1: Find a way to automate clicking

To solve this, we got into our wayback machine and wrote an AppleScript that would move the mouse to specific coordinates on the screen, wait a specified number of milliseconds for menus to appear and then click in the appropriate places. Our iOS program had a button that, when clicked, saved the original word, the presented misspelled word, and the final result of the correction. Our AppleScript script clicked through the menus, replaced the word if the simulator presented a suggestion, then clicked the button to serve the next word. 

We tried to make this process as fast as possible but it ended up taking around 1.6 seconds per word. 1.6  multiplied by 20,000 is 32,000 seconds, which equals 8.8 hours. But we also wanted to present even more misspelling options—twelve more in total.  

We can call this Step 2, create more misspellings:

1. Double last character.

2. Double last character with a capitalized first character.

3. Missing last character.

4. Missing last character with a capitalized first character.

5. Misspelled first character (via left misspelling adjacency) and capitalized first character.

6. Misspelled first character (via left misspelling adjacency).

7. Misspelled first character (via right misspelling adjacency) and capitalized first character.

8. Misspelled first character (via right misspelling adjacency).

9. Misspelled second character (via left misspelling adjacency) and capitalized first character.

10. Misspelled second character (via left misspelling adjacency).

11. Misspelled second character (via right misspelling adjacency) and capitalized first character.

12. Misspelled second character (via right misspelling adjacency).

So, including our first misspelled last character with left/right adjacencies, we had 14 lists of 20,000 words to run through. 14 multiplied by 8.8 hours = 123.2 hours, which is five days if the program ran straight for 24 hours a day. We needed to take a break in between each of the 14 sessions, however, and restart Xcode just in case there was a learning algorithm—we didn’t want the results of one session to pollute another.

Renting computers from Amazon is easy and but not if they’re Mac OS computers, which aren’t available through Amazon and get rather expensive through other dealers. Fortunately, the Columbia School of Journalism let us take over one of their Mac computer labs and we were able to run script in parallel and finished in a much more reasonable time frame. I was also able to not have my laptop out of commission crunching words for a week. Here’s a Vine of what the automated corrections looked like: 

One drawback of this method was that we could only get the mouse simulator to select the first suggestion. So, in the scenario that for the misspelled word “abortiom”, “aborted” was suggested as more likely than “abortion,” this program would make that as an inaccurate correction. Wearen’t too worried about this, though, because 1) our iOS script in stage one *did* take into account multiple suggestions, so all the words had two chances to be corrected in that scenario, and 2) we presented 14 different misspellings of these words and if any one of these variations was correctly spelled then we counted that accurately corrected. If a word that is only off by one character isn’t suggested that many times, then something in the algorithm isn’t handling that word correctly.

Step 3: Analyze results

This second stage only cut out around 6,000 words, leaving us with 14,000 words that were never accurately corrected. The ++related article++[] lays out our findings but our initial hypothesis that “abortion” is a word that iOS doesn’t correct, unlike Android phones, held true. Apple declined to comment for this project so we have many unanswered questions. One idea for future research is whether iOS devices are incapable of learning certain words like “abortion.” That is to say, these words are blocked not just on the dictionary suggestion level, but on the machine learning level as well.

Stage Zero:  Find the files.

Before we did stage 1 we had a different strategy: find this list of seemingly banned somewhere in the iOS file structure. To do this, we put out a call on Facebook for any friends that would donate an old iPhone to be jailbroken. We got three phone: one from my mom, and two from some very nice old friends who mailed them to our offices. We factory-reset and jailbroke one and kept the others as factory-fresh for testing. We went searching and found some promising files in the LinguisticData directory called “pos” “ner” and “lemmas” which in the natural language processing world, stand for “part of speech”, “named entity recognition” and “lemmatization,” which is analyzing word stems and inflected forms like “better” being associated with “good” as its base. These files were unreadable, however, because they weren’t in any known format. The only way we could read them was in their raw binary-hex format which looks like that terrible mess of characters you see when you open a corrupted word document—like Wingdings but with less rhyme or reason.

After many attempts at deciphering where a list of blocked words could reside and reaching out to the New York iOS community, we started in earnest with reverse engineering this list ourselves with stage 1.

The other week, we published our latest interactive: ‘Male Plumage’ Then and Now: The Changing Face of Men’s Fashion. Now we don’t consider ourselves fashionistas, but one morning Michael got an email to upload a Newsweek article to Document Cloud. It happened to be an article from 1968 about men’s plumage, apparently around the time men first started to define a distinct American modern style via patterns and accessories. We looked at the article and started laughing at all the nostalgic memorabilia and vintage photographs.
The Newsweek cover containing the article depicts a Ron Burgundy-like character in a pink suit with some of the fashion items around him, depicted as cut-outs for a paper doll. In 1968 this depiction of changing outfits was for illustrative purposes, but in 2013 how could we re-imagine this cover? Could we see the cover come to life by having the pieces of clothing actually be changeable by the reader? It was fun to think about adding another dimension to print, and to look at something from the past and apply it to something current.
At the same time, Isabel Wilkinson, one of Newsweek’s fashion writers was doing a story on Men’s Fashion Week 2013 in Europe. We thought it would be a great idea to compare men’s plumage in 1968 to the plumage we see today as part of the fashion show.  And so we had a project.
Under the hood
To begin, we imagined the cover pieces as draggable items and the 2013 plumage items would be stored in a drawer to the side. It was a fairly simple layout, allowing the cover to stand on its own and then become transformed by the reader into the moveable pieces. Instead of hardcoding the individual elements, we utilized The Miso Project from GitHub to create JSON from a Google Spreadsheet via an API key, which gave us flexibility to add or remove pieces as we were designing it. We used Underscore.js as our templating engine to make our HTML elements. More about Miso is explained in this ++previous post++ but for this project, under a tight news schedule, we used the ease of filling out an excel spreadsheet to easily get our JSON needed for our template. In our JSON, each object has an id, image to load to the template, layer or z-index value, and a classification of either “then”(1968) or “now”(2013):



In our HTML, we added a script tag to identify our template and classes containing values from our JSON, something like:

In our Javascript, we grabbed the template markup by its id and turned it into an Underscore template function with _.template(). We then appended each row from the spreadsheet (now JSON) to either a div containing the the items of “then” or a div containing the items of “now”, either ‘#plm-canvas-then’ or ‘#plm-canvas-now’. The syntax looks something like this (slightly different than our deadline code, but slightly nicer as well):

To make the elements draggable, we used jQuery UI like this for all of the item elements, since they all had the same class:

Drag events don’t translate too well on mobile, however. To fix that we loaded the plug-in jQuery UI Touch , which translated the appropriate touch handlers for mobile screens.
Now we have all of our elements draggable and organized based on if they are on the cover or in the “drawer”. The drawer element is something Michael came up with to allow the reader to pull and drag items into one place. Further, once the reader makes the plumage combination they like, it could be possible to share their design via social media. I’ll pass it on to Michael now to explain the development of both of these features.
The Drawer and Shareable Links
Thanks Clarisa. This was a fun interactive to work on and had a few tricky details—this drawer was one of them. As Clarisa said, we wanted to mashup the old style with the new. We also wanted an option to remove items from the canvas once you were done with them so you could clean up your design. In addition to .draggable(), jQueryUI has .droppable(), which lets you drop draggable items.
This was a little tricky because we wanted elements to be absolutely positioned outside the drawer but relatively positioned inside the drawer so they would stack up on top of each other. We handled this through swapping classes on .mousedown() and applying different positioning for each class.
To make it all shareable, we used jQuery BBQ which is super handy. On item drag, we recorded the final x-y position of each element and whether it was in the drawer or the canvas. We used jQuery.bbq.pushState()with a merge state of 0 to accomplish this. On load, we checked to see if someone had a saved state in the hash and if so, drew the plumage to reflect that.
Cues to interactivity

One small detail is the rotation that happens when you start the interactive. We needed to rotate the items because 1) they wouldn’t fit on the model in their original positioning; and 2) we wanted to cue the reader that the hitherto static image was now alive. This was also the logic behind spinning the model’s pinks suit: we needed a way to tell the reader that the image was was in the 21st century.
 
Clarisa Diaz and Michael Keller
 

The other week, we published our latest interactive: ‘Male Plumage’ Then and Now: The Changing Face of Men’s Fashion. Now we don’t consider ourselves fashionistas, but one morning Michael got an email to upload a Newsweek article to Document Cloud. It happened to be an article from 1968 about men’s plumage, apparently around the time men first started to define a distinct American modern style via patterns and accessories. We looked at the article and started laughing at all the nostalgic memorabilia and vintage photographs.

The Newsweek cover containing the article depicts a Ron Burgundy-like character in a pink suit with some of the fashion items around him, depicted as cut-outs for a paper doll. In 1968 this depiction of changing outfits was for illustrative purposes, but in 2013 how could we re-imagine this cover? Could we see the cover come to life by having the pieces of clothing actually be changeable by the reader? It was fun to think about adding another dimension to print, and to look at something from the past and apply it to something current.

At the same time, Isabel Wilkinson, one of Newsweek’s fashion writers was doing a story on Men’s Fashion Week 2013 in Europe. We thought it would be a great idea to compare men’s plumage in 1968 to the plumage we see today as part of the fashion show.  And so we had a project.

Under the hood

To begin, we imagined the cover pieces as draggable items and the 2013 plumage items would be stored in a drawer to the side. It was a fairly simple layout, allowing the cover to stand on its own and then become transformed by the reader into the moveable pieces. Instead of hardcoding the individual elements, we utilized The Miso Project from GitHub to create JSON from a Google Spreadsheet via an API key, which gave us flexibility to add or remove pieces as we were designing it. We used Underscore.js as our templating engine to make our HTML elements. More about Miso is explained in this ++previous post++ but for this project, under a tight news schedule, we used the ease of filling out an excel spreadsheet to easily get our JSON needed for our template. In our JSON, each object has an id, image to load to the template, layer or z-index value, and a classification of either “then”(1968) or “now”(2013):

In our HTML, we added a script tag to identify our template and classes containing values from our JSON, something like:

In our Javascript, we grabbed the template markup by its id and turned it into an Underscore template function with _.template(). We then appended each row from the spreadsheet (now JSON) to either a div containing the the items of “then” or a div containing the items of “now”, either ‘#plm-canvas-then’ or ‘#plm-canvas-now’. The syntax looks something like this (slightly different than our deadline code, but slightly nicer as well):

To make the elements draggable, we used jQuery UI like this for all of the item elements, since they all had the same class:

Drag events don’t translate too well on mobile, however. To fix that we loaded the plug-in jQuery UI Touch , which translated the appropriate touch handlers for mobile screens.

Now we have all of our elements draggable and organized based on if they are on the cover or in the “drawer”. The drawer element is something Michael came up with to allow the reader to pull and drag items into one place. Further, once the reader makes the plumage combination they like, it could be possible to share their design via social media. I’ll pass it on to Michael now to explain the development of both of these features.

The Drawer and Shareable Links

Thanks Clarisa. This was a fun interactive to work on and had a few tricky details—this drawer was one of them. As Clarisa said, we wanted to mashup the old style with the new. We also wanted an option to remove items from the canvas once you were done with them so you could clean up your design. In addition to .draggable(), jQueryUI has .droppable(), which lets you drop draggable items.

This was a little tricky because we wanted elements to be absolutely positioned outside the drawer but relatively positioned inside the drawer so they would stack up on top of each other. We handled this through swapping classes on .mousedown() and applying different positioning for each class.

To make it all shareable, we used jQuery BBQ which is super handy. On item drag, we recorded the final x-y position of each element and whether it was in the drawer or the canvas. We used jQuery.bbq.pushState()with a merge state of 0 to accomplish this. On load, we checked to see if someone had a saved state in the hash and if so, drew the plumage to reflect that.

Cues to interactivity

One small detail is the rotation that happens when you start the interactive. We needed to rotate the items because 1) they wouldn’t fit on the model in their original positioning; and 2) we wanted to cue the reader that the hitherto static image was now alive. This was also the logic behind spinning the model’s pinks suit: we needed a way to tell the reader that the image was was in the 21st century.

 

Clarisa Diaz and Michael Keller

 

The other month, we asked readers why they did or did not own guns. Here’s the post on the creation of that. We ended up getting over 1,500 responses, and our readers were largely thoughtful in what they wrote. Before we launched this project, we didn’t know how our audience would break down into gun owners and non-gun owners, and whether we would get the usual talking points from each side or hear new points of view. We fortunately did see a lot of interesting stories and patterns. But uncovering those trends in over a thousand responses in only a few days was a challenge. Reading through every response on deadline was unfeasible, so we had a machine do it. 
The idea of using natural language processing to make sense of a large number of reader comments came from Blair Hickman, Social Media Producer at ProPublica and we ended up using the Overview Project a natural language processing tool developed by the Associated Press (screenshotted above). You can read more about the technicals of how it works here but generally, it looks for clusters of related words and groups them together in that tree layout you see. 
Some interesting trends (Read the analysis with selected comments) that surfaced were a group of comments about how growing up with guns influenced people to both own and not own guns later in life. Many people framed their non-ownership of guns around their lack of a need and the algorithm found this cluster around phrases with “need” in them. Viewed in this light, many gun owner responses can be seen as implicitly responding to this prompt, explaining their clear, pragmatic need for gun ownership such as hunting, protection in rural areas or many others we included. Overview also brought out repeated comments that had variations on the phrase “When seconds count, police are minutes away,” a similar need for self-protection or a desire for self-reliance, depending on how you interpret it.
A number of readers discussed how their association with the military sometimes led them to be comfortable with firearms or, in many cases, the opposite. As one person put it: “As an army officer I came to realize many guns are tools created with a main purpose of killing. I don’t hunt and I don’t target shoot. No need for a gun.”
The clustering algorithm also grouped comments related to how experiences with family members committing suicide led people to not own guns, as well as religion. The posts concerning religion were especially interesting since in 2008 campaign speech, President Obama said that some people “cling to guns or religion” as a way to “explain their frustrations” with society. All of the comments submitted to us that discussed religion were from staunch non-gun owners, however. 
I think the post is worth a read if you’re interested in seeing an attempt at surfacing a conversation. On the design side, you can see we tried to get out of the way of the readers’ voices as much as possible: we wrote a short sentence intro framing each collection of comments and then let them tell their stories. Our front-end dev wiz Lynn Maharas actually pushed through that blue background quote style — which didn’t exist on our site before — especially for this so that we could string a bunch of quotes together and still be readable.
Using Overview was pretty easy, especially since it has a web interface now that you can upload your documents to. You have to remove any commas, first, however. We augmented some clusters with simple keyword searches of thematically related words that the algorithm might not be able to connect the dots between. For instance, if military was one cluster, we might want to look at Navy and Army as related words. This might not actually give any extra insight since the algorithm is looking for clusters anyway so it might pick up on those. But it could help guide a more analog approach once the machine has broken things down into potentially interesting categories. 
We clustered the spreadsheets of responses independently of one another, but one analysis would be to see if the machine would divide them automatically into two camps. Always more analysis to be done than deadlines allow. Here’s the anonymized data though, let us know at NewsBeastLabs@gmail.com if you find anything interesting.
-Michael

The other month, we asked readers why they did or did not own guns. Here’s the post on the creation of that. We ended up getting over 1,500 responses, and our readers were largely thoughtful in what they wrote. Before we launched this project, we didn’t know how our audience would break down into gun owners and non-gun owners, and whether we would get the usual talking points from each side or hear new points of view. We fortunately did see a lot of interesting stories and patterns. But uncovering those trends in over a thousand responses in only a few days was a challenge. Reading through every response on deadline was unfeasible, so we had a machine do it. 

The idea of using natural language processing to make sense of a large number of reader comments came from Blair Hickman, Social Media Producer at ProPublica and we ended up using the Overview Project a natural language processing tool developed by the Associated Press (screenshotted above). You can read more about the technicals of how it works here but generally, it looks for clusters of related words and groups them together in that tree layout you see. 

Some interesting trends (Read the analysis with selected comments) that surfaced were a group of comments about how growing up with guns influenced people to both own and not own guns later in life. Many people framed their non-ownership of guns around their lack of a need and the algorithm found this cluster around phrases with “need” in them. Viewed in this light, many gun owner responses can be seen as implicitly responding to this prompt, explaining their clear, pragmatic need for gun ownership such as hunting, protection in rural areas or many others we included. Overview also brought out repeated comments that had variations on the phrase “When seconds count, police are minutes away,” a similar need for self-protection or a desire for self-reliance, depending on how you interpret it.

A number of readers discussed how their association with the military sometimes led them to be comfortable with firearms or, in many cases, the opposite. As one person put it: “As an army officer I came to realize many guns are tools created with a main purpose of killing. I don’t hunt and I don’t target shoot. No need for a gun.”

The clustering algorithm also grouped comments related to how experiences with family members committing suicide led people to not own guns, as well as religion. The posts concerning religion were especially interesting since in 2008 campaign speech, President Obama said that some people “cling to guns or religion” as a way to “explain their frustrations” with society. All of the comments submitted to us that discussed religion were from staunch non-gun owners, however. 

I think the post is worth a read if you’re interested in seeing an attempt at surfacing a conversation. On the design side, you can see we tried to get out of the way of the readers’ voices as much as possible: we wrote a short sentence intro framing each collection of comments and then let them tell their stories. Our front-end dev wiz Lynn Maharas actually pushed through that blue background quote style — which didn’t exist on our site before — especially for this so that we could string a bunch of quotes together and still be readable.

Using Overview was pretty easy, especially since it has a web interface now that you can upload your documents to. You have to remove any commas, first, however. We augmented some clusters with simple keyword searches of thematically related words that the algorithm might not be able to connect the dots between. For instance, if military was one cluster, we might want to look at Navy and Army as related words. This might not actually give any extra insight since the algorithm is looking for clusters anyway so it might pick up on those. But it could help guide a more analog approach once the machine has broken things down into potentially interesting categories. 

We clustered the spreadsheets of responses independently of one another, but one analysis would be to see if the machine would divide them automatically into two camps. Always more analysis to be done than deadlines allow. Here’s the anonymized data though, let us know at NewsBeastLabs@gmail.com if you find anything interesting.

-Michael

UPDATE: FEB 10 @RepsGunTweets has been changed to @YourRepsOnGuns. Check out www.ThisIsYourRepOnGuns.com for the ongoing project.

Brian Abelson is a data scientist who is graciously donating his time at NewsBeast Labs before he starts a full-time position as a Knight-Mozilla Open News Fellow at the New York Times in February.
For an upcoming project on the gun debate, we’ve been monitoring statements representatives have made on the topic. As President Obama prepared to unveil his proposal for gun control on Wednesday, Michael and I were curious to see the reactions of representatives to the highly publicized announcement and be able to report that in real-time. Given the degree to which breaking news is now reported (and responded to) on social media, we thought it would be useful to build a bot to log officials’ comments on certain issues and present them in real time. Such a tool could be used by news rooms to engage their readers on a continuous basis by aggregating and serving content from members of particular communities or who serve on different committees.
@RepsGunTweets was born.
We were inspired by the work of 2013 Mozilla-Knight OpenNews fellows who recently built a prototpe for an app called “if (this) then news,” a news-oriented take on IFTTT – a site for linking triggers from gmail, twitter, dropbox, and other services to actions on the web. Applying this logic to news coverage, the fellows created the shell for a tool that would monitor live data streams, detect important events, and issue notifications. As Vice President Biden took the mic, we started furiously coding up a bot that would follow the twitter accounts of US Representatives and retweet any comment that included “gun”, “assault weapon”, “firearm”, or other relvant keywords. After a couple hours of missteps and headaches, we eventually got @RepsGunTweets up and running. In the last ten days, the bot has logged 307 tweets; two-thirds of which came in the first three days. We’re still analyzing the conversation but one interesting observation is representatives who are not in favor of gun control tend to link to longer explanations of their position on their website instead of tweet a comment.
Under the hood
At its core a retweet bot is a pretty simple tool: Follow a feed, find what matters, and serve it back up under a single account. The harder part is figuring out how to accurately communicate with Twitter’s API. Using tweepy for python we were able to easily access twitter’s numerous methods. All we needed to provide it with were the the consumer key, consumer secret, access token, and access token secret for an application generated on http://dev.twitter.com/apps. The bot follows CSPAN’s member of congress list and applies a regular expression for the desired keywords and retweets any matches.For even more technical info, check out this Github page


UPDATE: FEB 10 @RepsGunTweets has been changed to @YourRepsOnGuns. Check out www.ThisIsYourRepOnGuns.com for the ongoing project.

Brian Abelson is a data scientist who is graciously donating his time at NewsBeast Labs before he starts a full-time position as a Knight-Mozilla Open News Fellow at the New York Times in February.

For an upcoming project on the gun debate, we’ve been monitoring statements representatives have made on the topic. As President Obama prepared to unveil his proposal for gun control on Wednesday, Michael and I were curious to see the reactions of representatives to the highly publicized announcement and be able to report that in real-time. Given the degree to which breaking news is now reported (and responded to) on social media, we thought it would be useful to build a bot to log officials’ comments on certain issues and present them in real time. Such a tool could be used by news rooms to engage their readers on a continuous basis by aggregating and serving content from members of particular communities or who serve on different committees.

@RepsGunTweets was born.

We were inspired by the work of 2013 Mozilla-Knight OpenNews fellows who recently built a prototpe for an app called “if (this) then news,” a news-oriented take on IFTTT – a site for linking triggers from gmail, twitter, dropbox, and other services to actions on the web. Applying this logic to news coverage, the fellows created the shell for a tool that would monitor live data streams, detect important events, and issue notifications. As Vice President Biden took the mic, we started furiously coding up a bot that would follow the twitter accounts of US Representatives and retweet any comment that included “gun”, “assault weapon”, “firearm”, or other relvant keywords. After a couple hours of missteps and headaches, we eventually got @RepsGunTweets up and running. In the last ten days, the bot has logged 307 tweets; two-thirds of which came in the first three days. We’re still analyzing the conversation but one interesting observation is representatives who are not in favor of gun control tend to link to longer explanations of their position on their website instead of tweet a comment.

Under the hood

At its core a retweet bot is a pretty simple tool: Follow a feed, find what matters, and serve it back up under a single account. The harder part is figuring out how to accurately communicate with Twitter’s API. Using tweepy for python we were able to easily access twitter’s numerous methods. All we needed to provide it with were the the consumer key, consumer secret, access token, and access token secret for an application generated on http://dev.twitter.com/apps. The bot follows CSPAN’s member of congress list and applies a regular expression for the desired keywords and retweets any matches.For even more technical info, check out this Github page

Six Months in Review

NewsBeast Labs is roughly six months old and we’ve had a lot of fun. This tumblr has most of our projects for the past few months but there are a bunch from before our launch. Here’s a rough list of projects we’ve done so far.

Legal Experts Decode the Supreme Court’s Obamacare Ruling - Our very first project! We launched it the day we got DocumentCloud, which was also the morning of the Supreme Court ruling on Obamacare. We asked two law professors to make margin notes in the text of the ruling as they were reading it for the first time. Readers could follow along and read experts’ reactions as the conversation was happening.

Digital 100: Who’s Following Whom? - A network visualization of how Newsweek’s list of influential people in the digital space interact with each other on Twitter. 

Obamacare: It’s Cheaper! - I like to call these “Story Visualizations” - visual presentations of stories that could run as a list or as text, but are much more interesting visually. Matt DeLuca and I did a side-by-side on how Obamacare would affect different age groups’ healthcare spending.

2012 Olympics: The Latest Medal Tally - We had a live-updating Olympic Medal Count, (with a snazzy sortable table that I’ve written about before) that I worked on with our awesome intern Sarah Hedgecock. We also did a version of it for our right rail (sidebar).

Interactive Map: London’s Olympic Transformation - The Olympic Park rose from the rust. Sarah and I also did a satellite view before and after interactive that included a bunch of info on the star-chitect buildings.

Interactive Map: The U.S. Shooting Epidemic - Following the Aurora shooting, Brian Abelson and I made an interactive map of multiple-victim shootings since 2005 and asked readers to respond with their memories. We published a selection of the reader responses here. The full spreadsheet list is here.

As Income Inequality Widens, Rich Presidential Candidates Dominate - Lauren Streib and I worked on a chart (she did all the numbers), showing presidential income over the years. I remember this one chart taking four hours from start to finish for some reason…

Big Guns Inside the National Rifle Association Leadership - Who’s leading the NRA? I worked on a project with three colleagues Caitlin Dickson, Eliza Shapiro and Kevin Fallon on the NRA’s leadership. They dug through 990 forms and put together small profiles of the people at the top. We put it together in mosaic-style presentation. Normally this type of story would be a gallery format but since it’s not picture-based, we decided to create something more conducive to reading a lot of text.

SuperPAC App Election Ad Interactive - We partnered with the Super PAC App, an iPhone app that would identify political advertisements on TV and give you information about that group, such as how much money it was spending this election and articles about them. We made a web interface to their data to provide readers with more context for outside spending groups.

Interactive Map: Who’s Protesting Where? - When the Middle East erupted in protests in response to an anti-Muslim video uploaded to YouTube, Eliza Shapiro and I put together a visual guide with information on each protest as well as contextual information on each country. It was an interesting map to built since we had both point and polygon layers to deal with for hover states. As with all of our interactive maps, we used CartoDB.

Obama and Romney’s Bundlers - If bundlers had baseball cards, this is what they’d look like.We took a look at the biggest bundlers for each candidate. Collect ‘em all.

The Rise of the Political Non-Profit - How so-called “Dark Money” was influencing the 2012 election was one of the themes in a three-part series John Avlon and I wrote called the Super PAC Economy. This animated timeline overlays non-profit political expenditures and significant court decisions (Citizens United and lesser-known decisions) that determined what role these groups could play in politics.

The Dark Money Shuffle - Also in that series, we worked with Robert Maguire of the Center for Responsive Politics who had been compiling a database of grants that non-profits gave to each other. For the first time, we diagrammed this opaque world of money transfers that is only visible by manually going through hundreds of IRS forms. Full article

Election Right Rail - Showing the latest polls from battleground states, how those states voted historically, median income, and latest unemployment figures, our politics sidebar was full of context. It no longer lives anywhere on our site but you can see a standalone version how it looked on the eve of the election through the linked title.

Note: We did all of these projects before starting this tumblr. You’ll find write-ups for the projects that follow but if you want to know how we built any of the stuff above, send me a message at @mhkeller.

Debate Dashboard and Bingo - Brian, Sam, Vitaly Korenkov (one of our awesome developers) conceived of a great debate night dashboard. We had a livestream, a live chat with our commentators and a poll from Urtak, which is a polling platform that lets you pose simple yes/no/maybe questions to readers. It also lets readers submit questions they want other people to answer so it’s a good back and forth between questions we’re interested in and what our audience is interested in. We’re often into giving our readers a voice on the site so we liked it a lot. I came in during the last few hours before we were going to go live (a.ka. after all the hard work was done) and added a bingo card. The coolest part about it is the Bingo validation. The card checks how many you have in a row vertically, horizontally, and diagonally and tells you how many you need to win. NewsBeast Labs post.

Ground game: Obama Campaign Opens Up Big Lead in Field Offices - The airwave battle was being covered left and right, but we wanted to know what was happening on the ground. We scraped the two campaigns’ websites to map out their local HQs nationwide and found a big discrepancy between the two camps. In Ohio, for instance, Obama had a presence in so many counties where Romney didn’t that 10 percent of the state’s population lived in a county where the only volunteer center was an Obama HQ.

Technical note: We used CartoDB again for this map and it was a huge help. In the accompanying article, we ran interactive maps of Florida, Ohio, and Virginia. These separate maps required no real extra programming or map making since CartoDB builds your map by querying a database. By setting our map query to ‘SELECT * from national_map WHERE state = FL’  we had a local map in minutes that we could swap out for another state if needed, which indeed ended up happening. NewsBeast Labs post.

Interactive Hate: The Great Obama-Loathing Canon - Matt DeLuca and I teamed up again to solve the perennial problem of how do you present a lot of information to the reader in a way they can digest in bites that make sense. This time, we presented over a hundred anti-Obama books in a mosaic that you can filter down to different subject matters. NewsBeast Labs Post.

HavingTroubleVoting.com - We did an experiment on election day asking our readers, or anyone really, if they were having trouble voting, and if so, what kind of trouble. We plotted the responses on a map below and color-coded the markers based on the type of problem. We partnered with Mother Jones on it to help us go through the responses to find patterns and to contact people to tell their story. Our own reporters used the database in stories about massive lines and machine malfunctions. We’re totally honored and floored when CJR named it No. 2 in their Must-Read Interactives of 2012! More about it in our NewsBeast Labs post.

Election Night Interactive Map and Dashboard - A lot of teamwork went into our election night coverage from the development team, social, design… the list goes on. We took over our home page on election night with video commentary, a live updating tally, a live chat, article updates and more things that you could probably put a “live” prefix in front of. The map lives on in the linked title, a screenshot lives in our NewsBeast Labs post about it.

‘It Was Like a War Zone’: Hurricane-Ravaged Staten Island Reels - In the wake of the trauma caused by Hurricane Sandy, we did a map of Staten Island victims. It shows how many of the fatal tragedies were concentrated on the east side of the island.

Not-So-Super PACs: 2012’s Winners and Losers - DeLuca and I teamed up again to produce this tally of who made good investments this election cycle. There’s a long post about it, including some failed versions in our NewsBeast Labs post

Interactive Holiday Gift Guide - Lizzie Crocker, Isabel Wilkinson and I help you find out what sub-culture your friends might belong to in this gift guide flow chart.

Own a Gun? Tell Us Why? - December brought another terrible shooting and has caused much thought over the state of gun laws. We wanted to hear from rational people on both sides of the debate by lettings readers complete the sentences, “I own a gun because…” or “I don’t own a gun because…”. In three days, we had over 1,300 responses that represented very civil remarks from each group, for the most part. We analyzed the responses and did a state-by-state breakdown of the common themes. We used some interesting algorithmic clustering to find these patterns so expect a write-up soon. For now, read the post on how the project was born and how we collected the responses.

Notes and images from an ever-growing digital newsroom.

Newsweek & The Daily Beast

Contributors:
Brian Ries & Sam Schlinkert

Formerly:
Michael Keller, Andrew Sprouse, Lynn Maharas, & Clarisa Diaz

view archive