Home » Electronic News » Web crawlers are everywhere, how should they be regulated?

Web crawlers are everywhere, how should they be regulated?

Posted by: Yoyokuo 2023-01-03 Comments Off on Web crawlers are everywhere, how should they be regulated?

In the era of big data, in addition to collecting directly from users, another major source of data is the use of web crawlers to collect public information. To what extent are crawlers used? According to industry insiders, more than 50% or even higher traffic on the Internet is actually contributed by crawlers. For some popular web pages, crawler traffic may even account for more than 90% of the total page traffic.

From a technical point of view, a crawler is a process of simulating the behavior of humans surfing the Internet or browsing web pages or APPs through a program, and then grabbing the information needed by the crawler author. With the continuous development of the data industry, the value of data is increasing day by day, and the competition for data is becoming increasingly fierce. “Crawler” and “anti-crawler” have become an endless “offensive and defensive confrontation”. Some crawlers violate the wishes of the website, conduct unauthorized access to the website, and obtain a large amount of public or non-public data on the website, which has caused many legal disputes.

On October 23, Hangzhou Yangtze River Delta Big Data Research Institute, Shanghai Yangpu District People’s Procuratorate, Shanghai Enterprise Legal Counsel Association, Zhejiang Enterprise Legal Counsel Association and Caijing Business Governance Research Institute jointly launched the “Yangtze River Delta Data Compliance Forum” The Seminar on Legal Regulation of Data Crawler”, invited a number of heavyweight legal scholars, judges, prosecutors, and Internet practitioners to discuss “Data Crawler Technology and Industrial Impact”, “Data Crawler’s Civil Law Responsibility”, “Data Crawler Criminal Law”. Compliance” and other different perspectives to discuss.


Reptiles are everywhere

“Crawlers have a wide range of application scenarios, both compliant and non-compliant. For example, to capture the evaluation data of e-commerce websites for market research; those who do digital content can use crawlers to crawl the corresponding content on the Internet; Data, after optimization, launched a “paid version of the database”; Qichacha and Tianyancha are also using crawler technology to realize commercial use of government data.” Liu Yu, head of digitalization at L’Oreal China, introduced.

Liu Yu explained the basic principle of the crawler. Usually, the crawler will locate all the URL links of the website, obtain the data on the page, and then disassemble and use the data. Whether on the web or mobile, the basic crawler is based on this principle. The use of crawling technology has risks for both the ‘crawler’ and the ‘crawled party’, ranging from website crash to prison.

Specifically, for those small websites or websites with weak technical strength, if the crawler continues to visit 7X24 hours, the server may not be able to withstand the surge of traffic, causing the website to crash. What’s more troublesome is that for programmers who write crawlers, it may be illegal if they crawl to the point where they shouldn’t crawl data and then use the data.

Liu Yu said that in different scenarios, attitudes towards reptiles are quite different. For example, search engine crawlers are popular because search engines can increase the exposure of crawled websites; but most websites also do not want crawlers to crawl data based on server risks or various business reasons. There are two types of rejections, the ‘anti-climbing’ mechanism and the ‘anti-anti-climbing’ mechanism. Websites can formulate corresponding strategies or technical means to prevent crawlers from crawling data.

A common response strategy for websites is to place the Robots protocol, which was written by Dutch engineer Martijn Koster in 1994 and later became a common communication mechanism between the data crawler and the crawled party. In the “China Internet Industry Self-discipline Convention” issued by the Internet Society of China in 2012, compliance with the Robots protocol was identified as “internationally accepted industry management and business rules”.

However, Liu Yu said that the Robots agreement is more like a gentleman’s agreement, which can only play a notification role, not a preventive role. Crawler technology, anti-crawler technology, and anti-anti-crawler technology have been iterating all the time. As long as websites and apps can be accessed by users, there is a possibility of being crawled.

Bad reptile methods will cause waste of social and technical resources, which are hard-won. Zeng Xiang, the general counsel of Xiaohongshu, said that some crawlers will crawl data by “simulating real access” or “deciphering through protocols”. “These are disgraceful means, and the websites that have been crawled have to take offensive and defensive measures, causing a lot of waste of corporate resources.”

Zeng Xiang said that for content platforms, being attacked by crawler can easily infringe on the intellectual property rights enjoyed by themselves and users. Usually crawling is purposeful. If the core business secrets are crawled, they can be directly used elsewhere to form a competitive advantage. In addition, in his view, crawlers are also involved in the destruction of the public order of the Internet. “Whether the crawled data can be used effectively, whether it is put under supervision, and where the data flows are all very big question marks.”


Civil liability determination of reptiles

“Technology is neutral, but technology application is never neutral.” Sina Group Litigation Director Zhang Zhe said that when discussing the principles of crawler technology, it is more important to look at what crawler technology is used for and whether its behavior itself is legitimate. .

Recently, the Beijing Higher People’s Court (hereinafter referred to as the “Beijing Higher Court”) made a second-instance judgment on “Today’s Toutiao v. Weibo for Unfair Competition”. In this case, Weibo was sued for setting a blacklist in the Robots protocol to restrict ByteDance from crawling relevant web content. The court held that Weibo was a legitimate act within the scope of exercising the enterprise’s right to operate independently, and did not constitute unfair competition, and at the same time revoked the first-instance judgment. Zhang Zhe said that the judiciary’s evaluation of the Robots agreement is “one body and two sides”.

In 2020, when the Beijing High Court made a judgment on the “360 v. Baidu Unfair Competition Case”, it believed that Baidu should not use the subject as a distinction to restrict access to search engines to crawl website content without reasonable and legitimate reasons ( Too mouth-watering, keep it simple). In “Today’s Toutiao v. Weibo for Unfair Competition”, the court established the principle that an enterprise has the right to restrict other visitors within the scope of its own business, and only when it violates public interests and consumer rights. may be found to have acted improperly.

According to Gao Fuping, professor of the School of Law of East China University of Political Science and Director of the Data Law Research Center, the crawler and the data industry are linked together. The data intelligence and big data analysis that the so-called data companies talk about are basically capturing data, and then carrying out Mining analysis. It is now generally considered that crawler is a neutral technology, but more often, users are for the purpose of ‘gain for nothing’.

Gao Fuping believes that it is difficult to judge the legitimacy of crawlers without talking about the control of the legitimate producers of data. The discussion on the legality boundary of crawlers at home and abroad mainly focuses on the means and purpose of data crawling.

From the point of view of the means, the crawler ignores the access control of the website, or pretends to be a legitimate visitor, it will be considered illegal; from the point of view of the purpose, whether the data crawling party conducts some of the products or services provided by the crawling party. Substantial substitution”, if it is “substantial substitution”, the purpose is unlawful.

If a website legally accumulates data resources, then the website producer can control the use of it. More importantly, it is recognized that the data controller can open the data for commercial purposes, and allow the data to be used by more users through licensing, exchange, and transactions. people use. Gao Fuping added, “Based on the premise that the legitimate producers of data have control, it is possible to crack down on those crawlers who ignore the Robots protocol. “

Xu Hongtao, a judge in the Intellectual Property Division of Shanghai Pudong Court, believes that there are two issues to be considered about the Robots protocol and data flow: first, how to grasp the degree of “interconnection” and data sharing; second, the current measures adopted by various Internet industry operators Whether the Robots protocol strategy could lead to data silos. The essence of interconnection is to ensure the orderly flow of data, rather than forcing Internet industry operators to fully open data resources on their own platforms to competitors. In the context of “interconnection”, “orderly” and “circulation” are equally important and indispensable, and behaviors that impede fair competition and endanger user data security under the guise of “interconnection” need to be excluded.

In the case of a new media company crawling WeChat public platform data, the Hangzhou Internet Court has made its point clear. The network platform has set up the Robots protocol, hoping to abide by the competition norms during the competition process, or at least maintain a mutual respect and mutual compliance agreement, which is the basis for order.

In the above-mentioned cases, the court held that allowing third-party crawler tools to crawl the public account information will discourage the creation of the platform and distort the market competition mechanism for big data elements; from the perspective of consumer interests, unauthorized crawling and Display of information, Failing to respect the wishes of the subject of information release; from the perspective of public interests, the defendant did not dig deeply, innovate, or apply deeper levels of information after crawling the information, and failed to enhance the overall public interests of the society. In addition, the crawling of data sources was not normal. Justified.

Xu Hongtao believes that data is the core competitive resource of the content industry, and the data collected and analyzed by content platforms often have extremely high economic value. If content platform operators are required to open their core competitive resources to competitors indefinitely, it will not only violate the spirit of “interconnection”, but also be detrimental to the continuous change of high-quality content and the sustainable development of the Internet industry.

Xu Hongtao said that the judgment of the legitimacy of non-search engine crawlers can be summarized into four elements: first, whether to respect the Robots protocol preset by the crawled website; second, whether to destroy the technical measures of the crawled website; third See if the security of user data is threatened; the fourth is from the measure of creativity and public interest.

Xu Hongtao specifically pointed out that user data, including identity data and behavior data, is not only a competitive resource for operators, but also has personal privacy attributes, and the collection of such data is more related to social and public interests. If the user’s data security is compromised when scraping data, its behavior is not legitimate.


Crawler involved in criminal compliance

Criminal compliance, originally originated in the United States, refers to a set of supervision, restraint and incentive mechanisms established by the state to promote corporate compliance management by using criminal law as a tool.

In 2020, under the promotion of the Supreme People’s Procuratorate, grass-roots procuratorial organs in Shenzhen, Zhejiang, Jiangsu, Shanghai and other places actively explored corporate criminal compliance. In order to encourage more companies to carry out compliance reforms, a new criminal procedure system of “criminal compliance and no prosecution” has been rolled out across the country, trying to select companies involved in crimes that are likely to establish compliance. plan, and then take non-prosecution measures against the company.

Wu Juping, deputy director of the Third Procuratorial Department of the Second Branch of the Shanghai People’s Procuratorate, said that criminal compliance is mainly to give the companies involved a chance to rectify themselves and start over, and also to ensure high-quality social and economic development. At present, the criminal compliance that many companies are concerned about is more about how to avoid criminal risks in their business behaviors. Wu Juping believes that when companies use crawler technology for data analysis, they should focus on how to implement criminal compliance.

Wu Juping said, “In addition to Trojan horse virus programs and other technologies that are not legal in themselves, when we judge whether a behavior related to crawling technology constitutes a crime, we first look at what the perpetrator has done with crawling technology, whether there is social harm, and then Then judge whether the behavior is intrusion into the computer information system or illegally obtain computer information system data, and then look at whether the crawled data involves corporate data or personal information of citizens, and the relevant crimes are applicable respectively.”

Among them, it is also necessary to consider whether the legal attribute of the crawled data is property or just data. Wu Juping said that there is a lot of controversy in judicial practice. “For example, we have a case of forcing the other party to hand over virtual currency by means of illegal detention. It is criminally determined to be a crime of illegal detention, denying the property attribute of virtual currency, and civilly sentenced to return the property and recognize the property attribute.” She believes that, Data is an important factor of production in the development of the digital economy, and it should have the attribute of property in essence, but the current legal and judicial practices have not fully kept up.

Zhang Yong, a professor at East China University of Political Science and Law, classified the criminal behaviors that crawlers may involve: from the point of view of the rights and interests that may be infringed, including computer system security, personal information, copyright, state secrets, trade secrets, market competition order, etc.; from crawling From the perspective of the method, it may endanger the security of computer information systems, illegally obtain citizens’ personal information, illegally obtain business secrets, and destroy copyright technical protection measures. class, etc. “

Caijing E Law retrieved 54 criminal judgments related to reptiles on the Judgment Documents website, involving multiple crimes. Among them, 26 crimes were identified as crimes of infringing on citizens’ personal information; 10 crimes of illegally obtaining computer information systems; 5 crimes of disseminating obscene materials for profit; 3 crimes of destroying computer information systems; crimes of providing intrusion and illegal control of computer programs and tools 3 copies; 3 copies of the crime of infringement of intellectual property rights; 1 copy of the crime of illegal intrusion into the computer information system, the crime of opening a casino, the crime of theft, and the crime of fraud.

The Links:   2MBI150US-120-50 NL6448BC26-08D