[Right and Rules] web scraping
by Michael Uplawski from LinuxQuestions.org on (#5JZKJ)
Good morning
In a previous thread on a different topic, shruggy had pointed out that one interpretation of your Digital Millennium Copyright Act (DMCA) (see there) could prevent us from using a Web-content in a way that we prefer from what the site-owner had imagined - in the USA that is.
Web-scraping, in contrast, is defined as convenes the person or organism who publishes the definition - something between a way to do serious work and blatant net-abuse (my words).
Now it seems that ... I do something like that. What is your opinion?
It goes like this: I am quite seized up on the ill-conceived Web-sites of my favorite radio-stations - not the only topic I post on LQ, but I am approaching saturation.
As a means to avoid their friends" at diverse profiling- and tracing-companies, I have automated the retrieval of their RSS data (XML). By doing so, I have the direct URLs to their recent and even quite old broadcasts and can download them any time serenely without consulting the infested Web-site.
When I have *asked* the site-operator if they could just publish with their existing list of broadcasts a flat URL for each RSS, none of the subsidiaries of Radio France ever cared to respond.
I wrote a Web-bot which assembles these data now, but had to run only once to establish the complete list.
So far, I only use the data which is already present on the same page - be it after the click on a button which loads the URL for an RSS-stream dynamically... what rubbish! (But I can work with it).
But now, I noticed that there are so many broadcasts that I did not know existed, but look interesting. The short description lacks on the list and so my bot is obliged to open a second browser-tab for each single broadcast, get the description and close the tab before proceeding to the next item.<=== THIS IS IT.
Would you say this last step, where I combine information from two different pages to produce 1 list for private use were Web-scraping and what kind of Web-scraping... some evil terrorism thing or just mastering the tools of the Web"?
The legislation in Germany is similar in that it is not clear what were allowed", forbidden" or just impertinent". I somehow do not feel inclined to ask about the French law. The authorities here have declared themselves incompetent in the domains that had been attributed to them. It would be awfully difficult to avoid pitfalls and misunderstanding.
PS.: I forgot. The page, where I describe the RSS munging; scripts are commented in English. And here is one of those lists that my Web-bot created. There is no hyperlink in this file, the numbers identify a RSS each.
In a previous thread on a different topic, shruggy had pointed out that one interpretation of your Digital Millennium Copyright Act (DMCA) (see there) could prevent us from using a Web-content in a way that we prefer from what the site-owner had imagined - in the USA that is.
Web-scraping, in contrast, is defined as convenes the person or organism who publishes the definition - something between a way to do serious work and blatant net-abuse (my words).
Now it seems that ... I do something like that. What is your opinion?
It goes like this: I am quite seized up on the ill-conceived Web-sites of my favorite radio-stations - not the only topic I post on LQ, but I am approaching saturation.
As a means to avoid their friends" at diverse profiling- and tracing-companies, I have automated the retrieval of their RSS data (XML). By doing so, I have the direct URLs to their recent and even quite old broadcasts and can download them any time serenely without consulting the infested Web-site.
When I have *asked* the site-operator if they could just publish with their existing list of broadcasts a flat URL for each RSS, none of the subsidiaries of Radio France ever cared to respond.
I wrote a Web-bot which assembles these data now, but had to run only once to establish the complete list.
So far, I only use the data which is already present on the same page - be it after the click on a button which loads the URL for an RSS-stream dynamically... what rubbish! (But I can work with it).
But now, I noticed that there are so many broadcasts that I did not know existed, but look interesting. The short description lacks on the list and so my bot is obliged to open a second browser-tab for each single broadcast, get the description and close the tab before proceeding to the next item.<=== THIS IS IT.
Would you say this last step, where I combine information from two different pages to produce 1 list for private use were Web-scraping and what kind of Web-scraping... some evil terrorism thing or just mastering the tools of the Web"?
The legislation in Germany is similar in that it is not clear what were allowed", forbidden" or just impertinent". I somehow do not feel inclined to ask about the French law. The authorities here have declared themselves incompetent in the domains that had been attributed to them. It would be awfully difficult to avoid pitfalls and misunderstanding.
PS.: I forgot. The page, where I describe the RSS munging; scripts are commented in English. And here is one of those lists that my Web-bot created. There is no hyperlink in this file, the numbers identify a RSS each.