<bdo id='OMoNb'></bdo><ul id='OMoNb'></ul>
    <i id='OMoNb'><tr id='OMoNb'><dt id='OMoNb'><q id='OMoNb'><span id='OMoNb'><b id='OMoNb'><form id='OMoNb'><ins id='OMoNb'></ins><ul id='OMoNb'></ul><sub id='OMoNb'></sub></form><legend id='OMoNb'></legend><bdo id='OMoNb'><pre id='OMoNb'><center id='OMoNb'></center></pre></bdo></b><th id='OMoNb'></th></span></q></dt></tr></i><div id='OMoNb'><tfoot id='OMoNb'></tfoot><dl id='OMoNb'><fieldset id='OMoNb'></fieldset></dl></div>
  1. <tfoot id='OMoNb'></tfoot>

    1. <small id='OMoNb'></small><noframes id='OMoNb'>

      <legend id='OMoNb'><style id='OMoNb'><dir id='OMoNb'><q id='OMoNb'></q></dir></style></legend>

    2. 如何使用 Simple-HTML-DOM 提取完整的子链接?

      How to extract complete sub links using Simple-HTML-DOM?(如何使用 Simple-HTML-DOM 提取完整的子链接?)
    3. <tfoot id='OMKgw'></tfoot>
      1. <legend id='OMKgw'><style id='OMKgw'><dir id='OMKgw'><q id='OMKgw'></q></dir></style></legend>
          <bdo id='OMKgw'></bdo><ul id='OMKgw'></ul>
            <i id='OMKgw'><tr id='OMKgw'><dt id='OMKgw'><q id='OMKgw'><span id='OMKgw'><b id='OMKgw'><form id='OMKgw'><ins id='OMKgw'></ins><ul id='OMKgw'></ul><sub id='OMKgw'></sub></form><legend id='OMKgw'></legend><bdo id='OMKgw'><pre id='OMKgw'><center id='OMKgw'></center></pre></bdo></b><th id='OMKgw'></th></span></q></dt></tr></i><div id='OMKgw'><tfoot id='OMKgw'></tfoot><dl id='OMKgw'><fieldset id='OMKgw'></fieldset></dl></div>

                <small id='OMKgw'></small><noframes id='OMKgw'>

                  <tbody id='OMKgw'></tbody>

              • 本文介绍了如何使用 Simple-HTML-DOM 提取完整的子链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                问题描述

                以下是我用来从页面中提取子链接的基本代码:

                The following is the basic code I use to extract sublinks from a page:

                <?php
                    include_once('simple_html_dom.php');
                    function extract_links($target_url)
                    {   
                        $html = new simple_html_dom();
                        $html->load_file($target_url);  
                        $i=0;
                        $crawl =array();
                        foreach($html->find('a') as $link)
                        {
                            $crawl[$i] = $link->href;
                            $i++;
                        }
                        var_dump($crawl);
                    }
                    extract_links('http://stackoverflow.com');
                ?>
                

                输出如下:

                array
                  0 => string 'http://stackexchange.com' (length=24)
                  1 => string '/users/login' (length=12)
                  2 => string 'http://careers.stackoverflow.com' (length=32)
                  3 => string 'http://chat.stackoverflow.com' (length=29)
                  4 => string 'http://meta.stackoverflow.com' (length=29)
                  5 => string '/about' (length=6)
                  6 => string '/faq' (length=4)
                  7 => string '/' (length=1)
                  8 => string '/questions' (length=10)
                  9 => string '/tags' (length=5)
                  10 => string '/users' (length=6)
                  11 => string '/badges' (length=7)
                  12 => string '/unanswered' (length=11)
                  13 => string '/questions/ask' (length=14)
                  14 => string '?tab=interesting' (length=16)
                  15 => string '?tab=featured' (length=13)
                  16 => string '?tab=hot' (length=8)
                  17 => string '?tab=week' (length=9)
                  18 => string '?tab=month' (length=10)
                  19 => string '/questions/14611052/basic-standalone-jpa-example-with-postgres-using-eclipse' (length=76)
                  20 => string '/questions/tagged/eclipse' (length=25)
                  21 => string '/questions/tagged/postgresql' (length=28)
                  22 => string '/questions/tagged/jpa' (length=21)
                  23 => string '/questions/14611052/basic-standalone-jpa-example-with-postgres-using-eclipse' (length=76)
                  24 => string '/users/865448/tostao' (length=20)
                  25 => string '/questions/14611172/unable-to-fully-print-a-page-containing-iframes-in-chrome' (length=77)
                  26 => string '/questions/tagged/javascript' (length=28)
                  27 => string '/questions/tagged/jquery' (length=24)
                  28 => string '/questions/tagged/html' (length=22)
                  29 => string '/questions/tagged/html5' (length=23)
                  30 => string '/questions/tagged/google-chrome' (length=31)
                  31 => string '/questions/14611172/unable-to-fully-print-a-page-containing-iframes-in-chrome' (length=77)
                  32 => string '/users/962868/tejas' (length=19)
                  33 => string '/questions/14609779/how-can-i-configure-bash-to-handle-crlf-shell-scripts' (length=73)
                  34 => string '/questions/tagged/linux' (length=23)
                  35 => string '/questions/tagged/windows' (length=25)
                  36 => string '/questions/tagged/bash' (length=22)
                  37 => string '/questions/tagged/line-endings' (length=30)
                  38 => string '/questions/14609779/how-can-i-configure-bash-to-handle-crlf-shell-scripts/?lastactivity' (length=87)
                  39 => string '/users/1899640/that-other-guy' (length=29)
                  40 => string '/questions/14611169/using-one-socket-for-peer-to-peer-communication' (length=67)
                  41 => string '/questions/tagged/sockets' (length=25)
                  42 => string '/questions/tagged/p2p' (length=21)
                  43 => string '/questions/14611169/using-one-socket-for-peer-to-peer-communication' (length=67)
                  44 => string '/users/911651/xsnrg' (length=19)
                  45 => string '/questions/14611166/possible-mistake-in-ios-dev-guide' (length=53)
                  46 => string '/questions/tagged/iphone' (length=24)
                  47 => string '/questions/tagged/ios' (length=21)
                  48 => string '/questions/tagged/objective-c' (length=29)
                  49 => string '/questions/14611166/possible-mistake-in-ios-dev-guide' (length=53)
                  50 => string '/users/107715/matt-n' (length=20)
                  51 => string '/questions/14611163/how-to-use-dispatcher-in-wpf-to-make-a-timer' (length=64)
                  52 => string '/questions/tagged/wpf' (length=21)
                  53 => string '/questions/tagged/timer' (length=23)
                  54 => string '/questions/tagged/dispatcher' (length=28)
                  55 => string '/questions/14611163/how-to-use-dispatcher-in-wpf-to-make-a-timer' (length=64)
                  56 => string '/users/1741800/nashat' (length=21)
                  57 => string '/questions/14610879/how-can-i-handle-an-access-violation-in-visual-studio-c' (length=75)
                  58 => string '/questions/tagged/visual-c%2b%2b' (length=32)
                  59 => string '/questions/tagged/exception-handling' (length=36)
                  60 => string '/questions/tagged/access-violation' (length=34)
                  61 => string '/questions/tagged/structured-exception' (length=38)
                  62 => string '/questions/14610879/how-can-i-handle-an-access-violation-in-visual-studio-c/?lastactivity' (length=89)
                  63 => string '/users/901812/big-endian' (length=24)
                  64 => string '/questions/14611162/mvc-condintional-authorization' (length=50)
                  65 => string '/questions/tagged/c%23' (length=22)
                  66 => string '/questions/tagged/asp.net-mvc' (length=29)
                  67 => string '/questions/tagged/asp.net-mvc-4' (length=31)
                  68 => string '/questions/tagged/authorization' (length=31)
                  69 => string '/questions/14611162/mvc-condintional-authorization' (length=50)
                  70 => string '/users/644969/cadrell0' (length=22)
                  71 => string '/questions/14611160/get-customer-role-nopcommerce' (length=49)
                  72 => string '/questions/tagged/c%23' (length=22)
                  73 => string '/questions/tagged/razor' (length=23)
                  74 => string '/questions/tagged/nopcommerce' (length=29)
                  75 => string '/questions/14611160/get-customer-role-nopcommerce' (length=49)
                  76 => string '/users/1378841/mlg74' (length=20)
                  77 => string '/questions/14611158/iframe-resizing-nested-in-gridview' (length=54)
                  78 => string '/questions/tagged/resize' (length=24)
                  79 => string '/questions/14611158/iframe-resizing-nested-in-gridview' (length=54)
                  80 => string '/users/2026451/satish-patil' (length=27)
                  81 => string '/questions/14611157/php-how-to-check-the-value-got-this-word-from-a-var' (length=71)
                  82 => string '/questions/tagged/php' (length=21)
                  83 => string '/questions/tagged/preg-match' (length=28)
                  84 => string '/questions/tagged/strpos' (length=24)
                  85 => string '/questions/14611157/php-how-to-check-the-value-got-this-word-from-a-var' (length=71)
                  86 => string '/users/963414/samual99' (length=22)
                  87 => string '/questions/14611155/how-to-get-the-coordinates-of-boundries-of-drawable-on-the-mapview' (length=86)
                  88 => string '/questions/tagged/android' (length=25)
                  89 => string '/questions/tagged/google-maps' (length=29)
                  90 => string '/questions/14611155/how-to-get-the-coordinates-of-boundries-of-drawable-on-the-mapview' (length=86)
                  91 => string '/users/1520564/blubar' (length=21)
                  92 => string '/questions/14611153/why-css-is-empty-when-ssl-is-on-and-appcache-is-enabled-ipad-safari' (length=87)
                  93 => string '/questions/tagged/css' (length=21)
                  94 => string '/questions/tagged/ipad' (length=22)
                  95 => string '/questions/tagged/ssl' (length=21)
                  96 => string '/questions/tagged/mobile-safari' (length=31)
                  97 => string '/questions/tagged/html5-appcache' (length=32)
                  98 => string '/questions/14611153/why-css-is-empty-when-ssl-is-on-and-appcache-is-enabled-ipad-safari' (length=87)
                  99 => string '/users/2026375/twoface' (length=22)
                  100 => string '/questions/14611149/laravel-how-to-temporarily-store-eloquent-models-in-db-without-a-proper-schem' (length=97)
                  101 => string '/questions/tagged/php' (length=21)
                  102 => string '/questions/tagged/laravel' (length=25)
                  103 => string '/questions/14611149/laravel-how-to-temporarily-store-eloquent-models-in-db-without-a-proper-schem' (length=97)
                  104 => string '/users/291557/duality' (length=21)
                  105 => string '/questions/13928812/xmlserializer-generateserializer-and-collections' (length=68)
                  106 => string '/questions/tagged/c%23' (length=22)
                  107 => string '/questions/tagged/xml-serialization' (length=35)
                  108 => string '/questions/13928812/xmlserializer-generateserializer-and-collections/?lastactivity' (length=82)
                  109 => string '/users/1200614/phil' (length=19)
                  110 => string '/questions/14611145/keep-buttons-in-view-when-keyboard-opens-android' (length=68)
                  111 => string '/questions/tagged/android' (length=25)
                  112 => string '/questions/tagged/keyboard' (length=26)
                  113 => string '/questions/tagged/resize' (length=24)
                  114 => string '/questions/tagged/window' (length=24)
                  115 => string '/questions/tagged/views' (length=23)
                  116 => string '/questions/14611145/keep-buttons-in-view-when-keyboard-opens-android' (length=68)
                  117 => string '/users/1137413/725623452362' (length=27)
                  118 => string '/questions/14611144/ssdp-discovery-from-a-browser' (length=49)
                  119 => string '/questions/tagged/silverlight' (length=29)
                  120 => string '/questions/tagged/flash' (length=23)
                  121 => string '/questions/14611144/ssdp-discovery-from-a-browser' (length=49)
                  122 => string '/users/191882/legege' (length=20)
                  123 => string '/questions/14611143/how-to-syncrhonize-on-site-in-memory-no-sql-datasources-with-central-database-in' (length=100)
                  124 => string '/questions/tagged/architecture' (length=30)
                  125 => string '/questions/tagged/nosql' (length=23)
                  126 => string '/questions/tagged/java-ee-6' (length=27)
                  127 => string '/questions/tagged/in-memory-database' (length=36)
                  more elements...
                

                现在考虑数组中的/about"子链接.我希望它显示为https://stackoverflow.com/about".为什么只返回子链接的子部分,而在某些情况下返回完整的子链接?还有一些链接以?"符号开头.如何清理这些链接?

                Now consider '/about' sublink in the array. I want it to be displayed as 'https://stackoverflow.com/about'. Why only subpart of sublink is returned while in some cases complete sublink is returned ? Also some links are starting with '?' sign. How to sanitize these links ?

                考虑http://en.wikipedia.org/wiki/Web_crawler".现在,如果我对其执行 extract_links,我会得到一个这样的子链接http://en.wikipedia.org/wiki/Web_crawler/wiki/Web_search_engine",这是无效的并且大多数链接都是这种格式.正确的链接是http://en.wikipedia.org/wiki/Web_search_engine".我在另一个程序中使用这个函数,该函数将传递一个链接数组,所以我不能保持 if 条件静态.以下是我现在使用的代码片段:

                Consider "http://en.wikipedia.org/wiki/Web_crawler". Now if I perform extract_links on it, I get a sublink like this "http://en.wikipedia.org/wiki/Web_crawler/wiki/Web_search_engine" which is invalid and most of the links are of this format. The correct link is "http://en.wikipedia.org/wiki/Web_search_engine". And I am using this function in another program which will pass an array of links so I cannot keep the if conditions static. The following is the code fragment I am using now:

                foreach($html->find('a') as $link)
                {   
                    $href = $link->href;
                    $fchr = substr($href, 0, 1);
                    if ($fchr === '/')
                    {
                        $href = $target_url.$href;
                    }
                    else if ($fchr === '?')
                    {
                        $href = $target_url.'/'. $href;
                    }
                }
                

                推荐答案

                @pguardiario's comment

                正如他的评论中所建议的,phpUri 是将相对 URL 转换为绝对 URL 的完美解决方案.你可以在这里找到,

                @pguardiario's comment

                As suggested in his comment phpUri is the perfect solution for converting relative URLs to absolute. You can find it here,

                这篇关于如何使用 Simple-HTML-DOM 提取完整的子链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!

                相关文档推荐

                DeepL的翻译效果还是很强大的,如果我们要用php实现DeepL翻译调用,该怎么办呢?以下是代码示例,希望能够帮到需要的朋友。 在这里需要注意,这个DeepL的账户和api申请比较难,不支持中国大陆申请,需要拥有香港或者海外信用卡才行,没账号的话,目前某宝可以
                PHP通过phpspreadsheet导入Excel日期,导入系统后,全部变为了4开头的几位数字,这是为什么呢?原因很简单,将Excel的时间设置问文本,我们就能看到该日期本来的数值,上图对应的数值为: 要怎么解决呢?进行数据转换就行,这里可以封装方法,或者用第三方的
                mediatemple - can#39;t send email using codeigniter(mediatemple - 无法使用 codeigniter 发送电子邮件)
                Laravel Gmail Configuration Error(Laravel Gmail 配置错误)
                Problem with using PHPMailer for SMTP(将 PHPMailer 用于 SMTP 的问题)
                Issue on how to setup SMTP using PHPMailer in GoDaddy server(关于如何在 GoDaddy 服务器中使用 PHPMailer 设置 SMTP 的问题)
                    <tbody id='62Arb'></tbody>
                  <tfoot id='62Arb'></tfoot>
                  <legend id='62Arb'><style id='62Arb'><dir id='62Arb'><q id='62Arb'></q></dir></style></legend>
                        • <bdo id='62Arb'></bdo><ul id='62Arb'></ul>

                          <small id='62Arb'></small><noframes id='62Arb'>

                          <i id='62Arb'><tr id='62Arb'><dt id='62Arb'><q id='62Arb'><span id='62Arb'><b id='62Arb'><form id='62Arb'><ins id='62Arb'></ins><ul id='62Arb'></ul><sub id='62Arb'></sub></form><legend id='62Arb'></legend><bdo id='62Arb'><pre id='62Arb'><center id='62Arb'></center></pre></bdo></b><th id='62Arb'></th></span></q></dt></tr></i><div id='62Arb'><tfoot id='62Arb'></tfoot><dl id='62Arb'><fieldset id='62Arb'></fieldset></dl></div>