News from the Edge

Entity extraction, or What high blood pressure and Irish soda bread have in common with a multinational oil company

Posted by Jon Campbell on Mar 24, 2016 9:34:45 AM

In my early days of doing business research, I thought computer searches could solve all of the world’s research problems. One of my first assignments was for a large investment bank that only bought blue chip stocks. I would pull up a company’s stock ticker symbol and see all of its important financial data and news, then create a comprehensive research report. As I started to cross-reference this “tickered” news with additional keyword searches, I immediately saw the problem. Writers would often not include the ticker in a story – or they would include the wrong ticker. 

I tried using keywords with tickers, but that proved problematic as well. I still remember one of my more frustrating assignments. I had to pull up news about “British Petroleum,” which was transitioning its name to “BP.” I started by doing a ticker search, and then did my keyword cross-reference. As expected, I found some additional stories. I then added the new name “BP” to the search and got deluged with additional stories, some of them appropriate but many of them completely irrelevant.

Blog_26_BP_Station.jpgThere were health stories, an article about a playground opening in Brooklyn, a recipe for Irish Soda Bread and a request for proposal for a sewer project from the French government. What did they have in common? They all contained the term “BP,” which was also used as an abbreviation for Blood Pressure, Borough President, Baking Powder and Box Postale (French for Post Office Box). 

Clearly there was a problem, so I starting to go through this unlikely group of stories. I created a lengthy search string that included and excluded a large number of terms and phrases. I was always looking for an easier way to do my job and create research reports that would let me spend more time analyzing information and less time creating elaborate searches and hunting to find the term that triggered an incorrect result.

This is where a good entity extraction process can benefit searchers. Properly identifying company news as being related to BP or British Petroleum can save searchers a lot of time and aggravation, but this process is not as easy as it would seem. There are hundreds of thousands of companies, public and private, that have connections that we may not even consider, such as subsidiaries, parent companies and conglomerates. The same holds true for any other entity (people, brands or objects).

An effective extraction process (in this case properly identifying a company) relies on quality authority files for cross-referencing purposes. NewsEdge utilizes these in their process and is able to define all of the search terms for companies (and topics) into a single code. They take care of all of the search logic and handle the terms that throw off a search.

I know that when I use the product I will get the news I need without having to wade through a bunch of stories that have no relation to my research goals. It saves me a lot of time, and now that I am a consultant it helps me spend less time researching and more time pursuing my real passion – baking.

And hey, if you are interested, I have a great recipe for Irish Soda Bread.

 Try NewsEdge Today

photo credit: Golden Hour BP via photopin (license)

Topics: Content, taxonomy, categorization

Why read News from the Edge?

The NewsEdge Blog

NewsEdge, a service of Acquire Media, has been serving the information needs of busy professionals in corporate, finance and government for over 25 years.  We are experts in surfacing business relevant information through web, mobile and feed deliveries.  We specialize in content categorization and distribution to ensure users receive only the news they need, when they need it and how they need it.

Our blog aims to:

  • Discuss hot topics affecting the information industry
  • Offer our insights into new technologies - the good and the bad
  • Invite others to share viewpoints on how information is changing in real-world environments

Stay in Touch with Email Updates

Recent Posts