Filtering news
A player's name appears in an article. The crawler sees it, matches it, stores it. Now that article shows up on the player's news feed. Simple. Except the article isn't about them.
"Lakers finalise trade package centred around Anthony Davis" mentions LeBron James in paragraph four. A passing reference. Context, not subject. But the crawler doesn't know the difference between being the story and being mentioned in one.
The mention problem
Name-matching is binary. The name is either in the text or it isn't. What we actually need is relevance — is this article meaningfully about this player, or does it just reference them in passing?
A headline mention is strong signal. A first-paragraph mention is decent. A mention buried in paragraph six alongside fifteen other names is noise. But automating that distinction requires understanding article structure, not just scanning for strings.
What we do now
The current system uses headline fingerprinting and Jaccard similarity to deduplicate stories across sources. If ESPN and Yahoo both cover the same trade, only the highest-priority source version survives. That part works.
What doesn't work well is the relevance filter. A player mentioned once in a 2,000-word roundup gets the same treatment as a player who is the sole subject of a profile piece. Both get stored. Both show up. The feed fills with tangential mentions that dilute the signal.
Where this goes
The honest answer is that filtering context — distinguishing subject from mention — remains an ongoing challenge. The options are positional weighting (headline > lead > body), mention density (one name in 200 words vs. one name in 2,000), or co-occurrence patterns (is this player mentioned alongside their team's activity, or just namedropped?).
None of these are clean. All of them are better than what we have, which is: if the name appears, the article counts. Working on it.