Extract Main Article from HTML files in Directory Structure
by ericlindellnyc from LinuxQuestions.org on (#5Q2C3)
I've successfully used this to extract main article from one local HTML file:
Code:cat JamesWatt.html | trafilatura >> JamesWatt.txtTo do this recursively for a whole nested directory, I tried
Code:find . -name *.html' -exec cat {}' | trafilatura >> {}.txt' \;producing these errors
Code:find: -exec: no terminating ";" or "+"Code:trafilatura: error: unrecognized arguments: ;html2text removes tags, but leaves in menus and other non-article verbiage.
I've posted to two forums and for some reason haven't received any reply.
Assistance would be greatly appreciated !!
Code:cat JamesWatt.html | trafilatura >> JamesWatt.txtTo do this recursively for a whole nested directory, I tried
Code:find . -name *.html' -exec cat {}' | trafilatura >> {}.txt' \;producing these errors
Code:find: -exec: no terminating ";" or "+"Code:trafilatura: error: unrecognized arguments: ;html2text removes tags, but leaves in menus and other non-article verbiage.
I've posted to two forums and for some reason haven't received any reply.
Assistance would be greatly appreciated !!