Länk - Neural networks for extracting useful text from HTML

published Feb 10, 2011 01:04   by admin ( last modified Feb 10, 2011 01:04 )

 

You’ve finally got your hands on the diverse collection of HTML documents you needed. But the content you’re interested in is hidden amidst adverts, layout tables or formatting markup, and other various links. Even worse, there’s visible text in the menus, headers and footers that you want to filter out. If you don’t want to write a complex scraping program for each type of HTML file, there is a solution.



Läs mer: The Easy Way to Extract Useful Text from Arbitrary HTML - AI Depot