- A. Overview
- B. Definitions and delimitations
- C. CC-licensed training material
- D. Text and data mining (Section 44b UrhG)
- E. Machine learning output
Literature: Tim Dornis, The Protection of Artificial Creativity in Intellectual Property Law, GRUR 2019, 1252; Patrick Ehinger/Lara Grünberg, The Protection of Products of Artificial Creativity in Copyright Law, K&R 2019, 232; Maximilian Herberger, "Artificial Intelligence" and Law, NJW 2018, 2825; Till Jaeger, Artificial Intelligence: The Battle for Copyright, heise online, 17 February 2023, https://perma.cc/U37N-B2HZ; Lisa Käde, Creative Machines and Copyright – The Machine Learning Creation Chain from Training to Model Protection to Computational Creativity, 2021; Felicitas Lea Kleinkopf, Text and Data Mining, 2022; Felix Krone, Copyright Protection of ChatGPT Texts?, RDi 2023, 117; Anne Lauber-Rönsberg, Autonomous "Creation" – Copyright and Protectability, GRUR 2019, 244; Niklas Maamar, Copyright Issues in the Use of Generative AI Systems, ZUM 2023, 481.
A. Overview
1 The topic of "artificial intelligence" does not stop at CC licences – especially because in most cases unimaginable amounts of data are required to develop reliable systems. Understandably, CC-licensed material is often used for this purpose. This chapter deals with the compatibility of CC licences and machine learning or text and data mining reservations and the applicability of CC licences to machine learning output.
B. Definitions and delimitations
I. Text and data mining
2 Since 2021, text and data mining has been defined in Section 44b(1) of the German Copyright Act (UrhG) as "the automated analysis of individual or multiple digital or digitised works in order to obtain information, in particular about patterns, trends and correlations." The methodology is not new, and the term refers to the automated, algorithm-driven analysis of large amounts of text or data in general.
II. Machine learning
3 Machine learning refers to the process of using automated optimisation procedures to allow computer programmes to independently determine the conclusion "if A, then B" instead of defining it in advance. When people talk about ChatGPT, Midjourney or similar systems as "artificial intelligence" today, they are usually referring to machine learning models.
4 A programme that is supposed to learn to distinguish images of planets from images of astronauts, for example, receives images of astronauts and planets with the corresponding labels ("astronaut"/"planet") as training data within the framework of "supervised learning", i.e. the monitored learning process. The programme therefore "knows" what the desired result should be and optimises its internal parameters during the training process so that it recognises structures that images of astronauts and planets have in common. As a result, the programme is able to make statements about unknown, new images, indicating the probability of a planet or astronaut being recognisable in an image .
5 Today, artificial neural networks or decision tree-based approaches are often chosen for machine learning. In contrast to classic computer programmes, in which each variable is predefined by humans, the core of these machine learning systems consists of so-called models that process training data and (in many cases) produce output.
1. Model
6 The heart of machine learning systems usually consists of a large number (sometimes quadrillions) of parameters, i.e. numerical values, as well as structural information (hyperparameters) that define the arrangement of the parameters and specify calculation methods. This core is usually integrated into a "classic" computer programme within the meaning of Section 69a of the German Copyright Act (UrhG). It is also possible to combine different models with different capabilities.
2. Training data
7 Training data refers to the data (of any kind) used to "train" the models and may itself be protected by copyright (e.g. photos or as a database in accordance with Section 87a UrhG) or be simple data that is not subject to any property rights. In this context, "training" refers to the internal optimisation process, i.e. the optimisation of the trillions of parameters that determine the path from input to output. Training data, which can be in the form of images or text, is first converted into a uniform form for training and then converted by a computer programme into a format that can be processed by the model . The training data is then no longer recognisable as an image, but exists for the model only in a collection of numbers (vector), which does not necessarily make its way through the model in a coherent manner. The numerical form of the training data is used by the model to optimise its parameters, but is not otherwise stored in the model. Accordingly, models do not contain any reproductions of training data.
3. Input
8 When talking about machine learning models in the context of "input", this usually refers to information that users provide to the model (e.g. images, but currently also text prompts, for example in ChatGPT). Conceptually, a distinction is not always made between training data and input; however, a clear distinction must be made here because the associated actors (e.g. developers and users) exert very different influences on the models due to their very different roles and times of use. 4.
4. Output
9 The models differ significantly in terms of the results they produce. Many models only output probability values (so-called predictions), especially when it comes to image recognition. This is the case, for example, with industrial cameras used in the production process to identify errors in production. Generative machine learning models, i.e. models that generate data – images, text, music, computer programme code, etc. – such as ChatGPT, Stable Diffusion or Midjourney. These products are simply referred to as "output".
III. Artificial intelligence
10 The term artificial intelligence (AI) can be understood as an umbrella term for various technologies, including expert systems, robotics and machine learning. So when ChatGPT is referred to as "AI", what is actually meant is "machine learning". The lack of clarity has become so established that, for the sake of simplicity, the term "AI" is used in many places here as well. C.
C. CC-licensed training material
I. Do CC licences allow machine learning?
11 CC licences are generally open to use, with restrictions on areas of application only applying to non-commercial use through the use of NC licence types. The question can therefore be answered as follows: They do not explicitly prohibit it. Materials that have been published with CC licences can generally be used for machine learning.
II. Do CC licence obligations have to be fulfilled during training?
12 First of all, training usually only involves internal reproductions, for which CC licences do not impose any licence obligations anyway. Obligations may therefore only have to be fulfilled when distributing the trained models, such as naming the authors of the training data. This would require that the training data still be contained in the trained model, such as the trained neural network. This is precisely the view taken in the lawsuit pending in the United States brought by artists Andersen, McKernan and Ortiz against Stability AI and Midjourney.
13 Accordingly, the better arguments suggest that the licence obligations of CC licences do not have to be fulfilled in the context of training and the subsequent distribution of the trained model, provided that this data is not reproducibly contained in the model.
D. Text and data mining (Section 44b UrhG)
I. Scope of application of the limitation
14 The limitation in Section 44b UrhG has been in force since the amendment of the German UrhG in 2021 and implements Article 4 of the DSM Directive
"Text and data mining is the automated analysis of individual or multiple digital or digitised works in order to obtain information, in particular about patterns, trends and correlations."
15 Not least from the explanatory memorandum to the Act
16 The second paragraph of Section 44b UrhG contains the permission relevant to model training:
"Reproductions of lawfully accessible works for text and data mining are permitted. The reproductions must be deleted when they are no longer required for text and data mining ."
17 The phrase "for text and data mining" is to be understood broadly; it encompasses not only reproductions that occur during the training process (e.g. when reading in data), but also reproductions in the run-up to this, e.g. during data collection. This broad interpretation also follows from the principle "the right to read is the right to mine".
18 Accordingly, it is permissible to reproduce copyright-protected works found on the internet or elsewhere for the purpose of text and data mining in the context of machine learning without obtaining permission. The only conditions for this are that there is "lawful access" to the work and that no machine-readable reservation is attached (this is, in principle, a possibility under paragraph 3 of the provision to prevent the use of one's own works for text and data mining – however, some special features apply to CC-licensed works in this regard, see margin note 20 ff. and margin note 23 ff.).
19 Access is lawful if the work is freely available on the internet (and has not been made available there in an obviously illegal manner, such as on video piracy platforms) or if licensed access to the work is available, either by means of a subscription or under a CC licence.
II. Can data mining be reserved for CC works?
20 Section 44b (3) UrhG allows authors to prevent the use of their own works for text and data mining by attaching a machine-readable reservation notice. The question arises as to whether such a reservation is also possible when using CC licences.
21 Such a reservation restricts the use of the licensed work. There is no special CC licence analogous to CC BY-ND or CC BY-NC, such as CC BY-NT ("No Text and Data Mining" / "No Training"). However, a reservation attached separately to the licence in machine-readable form would de facto restrict the licensed rights.
22 In principle, authors are of course free to attach such a reservation in machine-readable form. However, on the one hand, the designation as a CC-licensed work may be misleading if such a reservation exists – with the consequence that the use of CC logos, CC buttons and trademarks in connection with the work would be inadmissible, and the work could not be designated as "CC-licensed" or similar (see section 7 and Annex Rn. 3 ff.).
III. The text and data mining exception vs. CC-*-NC
23 However, it may be possible to achieve a reservation de facto the means available under CC licences: With the increasing use of machine learning in the commercial environment, the question arises as to whether material published under an NC licence may be used for commercial model training due to the limitation of § 44b UrhG (German Copyright Act) or whether this is precluded by the reservation of non-commercial use.
24 NC licences prohibit the use of licensed material primarily for the purpose of obtaining commercial advantage or monetary remuneration (see section 1.i Rn. 73). However, a glance at Section 2.a.2 makes it clear that even the NC licence does not fundamentally prevent the application of the text and data mining limitation: "Exceptions and limitations. It is hereby clarified
"Exceptions and limitations. It should be clarified that wherever legal exceptions and limitations apply to your use, the present Public Licence does not apply and you are not required to comply with its terms in this respect."
25 Nevertheless, Section 44b UrhG (German Copyright Act) in its paragraph 3 allows for the attachment of machine-readable reservations. This raises the question of whether the NC addition should be interpreted as representing a sector-specific reservation against text and data mining with the correspondingly licensed works. However, the above-cited section 2.a.2 ("Exceptions and Limitations") shows that CC licences are not intended to restrict statutory exceptions. Accordingly, NC licences are not to be interpreted as a reservation against use for text and data mining .
IV. Advantages and disadvantages of machine learning under CC and Section 44b UrhG
26 When selecting training data, the question may arise as to whether CC-licensed works are preferable to non-CC-licensed works for which the limitation of Section 44b UrhG is invoked. The decisive factor here is which obligations or restrictions apply in each case.
27 For example, Section 44b (2) UrhG requires that the reproductions made or created be deleted if they are no longer required. This raises the question: When does this point in time occur? When an ML model has been fully trained for the first time? Or are longer periods of time conceivable here in order to save work results and be able to reproduce the training later? Is the necessity linked to the training of a model, or does it refer – for example, in a data science company – to all text and data mining?
28 Whether there is a reservation against the use of ML under Section 44b(3) UrhG must be examined, particularly in the case of works that are not CC-licensed. Such a reservation may also be appropriate in certain circumstances for CC-licensed works; for the effect, see Rn. 20 ff.
E. Machine learning output
I. Are CC licences for training material relevant for AI output?
29 In principle, CC licences for training material are not relevant for the use of the output if the output – as is usually the case – does not contain any reproduction of the training material. If this is nevertheless the case, for example when using corresponding prompts and parameters,
II. Are CC licences for machine learning models relevant to AI output?
30 The question aims to determine whether the licensing of the machine learning model has an impact on the copyright protection of the model's products. As a rule, this is unlikely to be the case: licensing always refers only to the specific protected object and possibly its derivatives. However, the products of ML models are not However, the products of ML models are not derivatives thereof (in the sense of modified/extended versions of the original work), but merely products of the system. As with other computer programmes (cf. e.g. the licensing of Open Office and a text document produced with it), the machine learning model and its output must be considered separately.
III. Can machine learning output be licensed under CC?
31 Whether CC licences can be used for ML output depends on whether there is a protectable work that can be the subject of a licence in the first place. Output that is not protectable cannot be licensed with effect vis everyone. The question of protectability is currently the subject of intense debate in legal scholarship;
32 With increasing influence – for example, through highly optimised prompts or iterative processing of a generated image – the likelihood that users are entitled to copyrights on the ML output increases.
33 This gives rise to various situations that must be assessed separately: 1.
Application of a CC licence to non-copyrighted ML output 2.
Application of a CC licence to copyright-protected ML output
Application of a CC licence to edited ML output
34 In 1., the use of a CC licence is not recommended because where no copyright exists, no rights can be granted by means of a licence. In the interests of legal certainty, it would be advisable to attach a public domain dedication, if possible. Conversely, the licence conditions of a CC licence for ML output must not simply be ignored. On the one hand, it is difficult to tell from the output whether creative input from a human being has been incorporated into it; on the other hand, it is also conceivable that the user is contractually bound by the CC licence (see below Rn. 38 ff.).
35 In 2., there is no objection to the use of a CC licence, provided that the users can be classified as authors.
36 3. This is likely to be a fairly common case – a product is created using an ML tool and edited for further use, for example by applying filters, expanding the content, changing colour spaces, adding or removing elements from an image, etc. In this case, it depends on whether the processing can be considered a personal intellectual creation – if so, it is possible to use a CC licence for the work created through processing. However, this does not affect the underlying public domain AI product, which can still be used, modified and distributed by anyone as they wish.
37 In most cases, one challenge is likely to be provability: as long as the use of AI tools potentially results in public domain status, there is no incentive to disclose that AI has been used. Proving this in individual cases in order to establish the presumption of copyright (Section 10 UrhG)
IV. Effects of contractual agreements, for example in the context of terms of use
38 It is not uncommon for the unclear situation regarding output authorship to be dealt with by AI model providers addressing the "distribution" of copyright in their terms of use. Midjourney, for example, grants users who have free access to the Midjourney, for example, grants users who have free access to the system copyrights to the output only in the form of a CC BY-NC licence, while paying users are declared to be the authors of the output they create .
39 Of course, the creation of original copyrights cannot be the subject of contractual agreements. In this respect, it depends solely on the actual situation and whether a personal intellectual creation exists.
40 However, it is conceivable, especially under the application of a foreign legal system, that the terms of use of an ML tool could lead to a contractual obligation. In this case, the user would be obliged to comply with the "licence terms" of the referenced CC licence, even though no protectable licence object exists. However, such a contractual agreement has no third-party effect. Anyone who receives such output without being a contractual partner of the ML tool provider may reuse it without restrictions.
Creative Commons License
Open Access Kommentar, Commentary on J. TDM, machine learning and AI is licensed under a Creative Commons Attribution 4.0 International License.