Marc Andreessen is one of my role models and favorite thought leaders in entrepreneurship. Back in simpler times, he used to write a blog of sagely startup advice. But since joining Twitter, Marc now shares his thoughts in small bursts of related tweets, forcing me to read his thoughts in a disjoint, reverse chronological order.

I decided to use Python fix this terribly pressing problem. The idea is to create a script that detects tweets posted in quick succession and batches them together. For amusement, it will also run some rudimentary NLP to generate a "title" for each batch of tweets. In the end, they will be output as a blog-like Markdown file (Marc-down file?)

The Twitter API

First we use twitter module to OAuth authenticate with Twitter. Despite the un-Pythonic method names, this module is quite pleasant to use.

import twitter
import json

api = twitter.Api(access_token_key="REDACTED",
                  access_token_secret="REDACTED",
                  consumer_key='REDACTED', consumer_secret='REDACTED')

The Twitter API doesn't actually provide a way to retrieve every tweet a user has created. Instead, we have to manually iterate over batches of tweets in reverse chronological order. Even then, the API limits requests to only the 3200 most recent statuses. While not ideal, it should be good enough for a quick demo.

def get_all_user_tweets(user):
    tweets = []
    offset = 0
    max_id = None
    while True:
        this_batch = api.GetUserTimeline(screen_name=user, max_id=max_id, count=200, exclude_replies=True)
        if len(this_batch) == 1:
            break
        tweets += this_batch[offset:]
        offset = 1
        max_id = min([tweet.id for tweet in this_batch])
    return [tweet.AsDict() for tweet in tweets]

tweets = get_all_user_tweets("pmarca")

Batching the tweets

When Marc goes on a Twitter rant, he typically posts several tweets about the same topic, within minutes of each other. After retreiving all his statuses, the next job is therefore to detect batches of tweets posted in quick succession.

import dateutil.parser

def diff_created_at(a, b):
    return abs(dateutil.parser.parse(a["created_at"]) - dateutil.parser.parse(b["created_at"])).total_seconds()

def batch_tweets(tweets, rapidness):
    batches = []
    curr_batch = []
    for tweet in tweets:
        if len(curr_batch) == 0 or diff_created_at(curr_batch[-1], tweet) < rapidness:
            curr_batch.append(tweet)
        else:
            batches.append(curr_batch)
            curr_batch = []
    batches.append(curr_batch)
    return batches

The batch_tweets function above takes a rapidness parameter, which is the maximum time between tweets to them be considered part of the same batch. There is no science to setting this parameter. Through random observation, I found that Marc usually posts related tweets within 4 minutes of each other. So we will use 240 seconds as the value of this parameter.

Finally, we want to toss out batches of only 1 tweet long because we are only interested in Marc's "longform" Twitter content.

batches = batch_tweets(tweets, 240)
rants = [batch for batch in batches if len(batch) > 1]

Generating titles using NLTK

What good are these pseudo blog posts without pseudo titles? Let's generate some titles for each batch of tweets using using NLTK. I experimented with various methods of title generation, from bigram collocation to simply outputting the most frequent words. They were all pretty crappy.

The best results came from finding noun phrases in the batches of tweets. While not perfect, many topics that Marc tweets about are noun phrases, such as "Entrepreneurial mindset" and "insatiable demand." Below are some methods to generate a parse tree out of a sequence of tokens and retrieve the noun phrases.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

grammar = "NP: {<DT>?<JJ>*<NN>}"
chunker = nltk.RegexpParser(grammar)
stopwords = stopwords.words('english')

def leaves(tree):
    """Finds NP (nounphrase) leaf nodes of a chunk tree."""
    return [subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.node=='NP')]

def extract_terms(tokens):
    postoks = nltk.tag.pos_tag(tokens)
    tree = chunker.parse(postoks)
    terms = leaves(tree)
    return terms

Outputting the "Marc"down

Finally we put this all together by iterating over the "rants," generating pseudo-titles and writing them to a file. If an interesting title couldn't be found, it simply outputs "Idea #XYZ".

import itertools

with open("rants.md", "w") as f:
    rant_count = 1
    f.write("# The Tweets of Marc Andreessen \n")
    for rant in rants:
        tokens =  list(itertools.chain(*[word_tokenize(t["text"]) for t in rant]))
        filtered_tokens = [t for t in tokens if t.lower() not in stopwords]
        nps = [p for p in extract_terms(filtered_tokens) if len(p) > 1]

        if len(nps) != 0:
            title = ' '.join([tok for tok, _ in nps[0]]).encode("utf8")
            f.write("## On %s\n" % title)
        else:
            f.write("## Idea %d\n" % rant_count)

        for t in reversed(rant):
            f.write(t["text"].encode("utf8"))
            f.write("\n\n")
        f.write("\n")
        rant_count += 1

So how did it turn out? Take a look for yourself here. The batching turned out to be fairly accurate in grouping together Marc's tweets into a series of related thoughts. The autogenerated titles are hit or miss. Half of them are gibberish and half are surprisingly accurate. Given the simplicity of the natural language functions I wrote, I'm not surprised.