URL verification
While working on a project, I needed to verify that a user-submitted URL was actually pointing to a valid page.
I found and slightly modified a code snippet that did just that using cURL. It's pretty simple and seemed quite effective for a long time. It just requests the header for the page and verifies that there was indeed a response.
function valid_url($p_url)
{
if (!substr_count($p_url, '://'))
$p_url = 'http://' . $p_url;
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$p_url);
curl_setopt($ch,CURLOPT_HEADER,1);//get the header
curl_setopt($ch,CURLOPT_NOBODY,1);//and *only* get the header
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);//get the response as a string from /
curl_exec(), rather than echoing it
curl_setopt($ch,CURLOPT_FRESH_CONNECT,1);//don't use a cached version of the url
if(!curl_exec($ch))
return false;
else
return $p_url;
}
Then today, a particular URL came along that caused this function to choke, slowing the page to a crawl and finally timing out after a long wait. While debugging and waiting for responses, I discovered that the page was coming back with an empty response.
Further examination yielded a HTTP response code of "0". Head scratching, but now I was on the right track. Looking at the response codes from a regular browser request, everything seemed OK.
That turned out to be the answer. Simply supplying a user agent with the cURL request was enough to trick this page into thinking it was responding to a browser. I guess they were trying to be sneaky. Edit: I had another related issue that tipped me off that this is bad form. Although, I'm only using this to grab the occasional header info to verify a page exists, I could still potentially be classified as a malicious bot and blacklisted. Definitely not cool. The solution is to make your own user agent that clearly gives your site or application's name and web address, e.g. User-Agent: MyApplication (+http://example.com/application-home/)
Adding in that line and a new check at the bottom to check for error status codes, like 404 or 503, I now have the following function which is working like a champ.
function valid_url($p_url)
{
if (!substr_count($p_url, '://'))
$p_url = 'http://' . $p_url;
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$p_url);
curl_setopt($ch,CURLOPT_HEADER,1);//get the header
curl_setopt($ch,CURLOPT_NOBODY,1);//and *only* get the header
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);//get the response as a string from /
curl_exec(), rather than echoing it
curl_setopt($ch,CURLOPT_FRESH_CONNECT,1);//don't use a cached version of the url
// Some sites will not respond without a user agent.
// Make sure to clearly identify your application and website, especially
// if making large numbers of requests.
curl_setopt($ch, CURLOPT_USERAGENT,
"User-Agent: MyApplication (+http://example.com/application-home/");
return (curl_exec($ch)
? (curl_getinfo($ch, CURLINFO_HTTP_CODE) < 400 ? $p_url : false)
: false);
}




