Training on open source code: what GPL, MIT and other licenses actually say about AI
🧠 If my model trains on GitHub, am I now “infected” by GPL?
That’s the question everyone from solo devs to Big Tech GC’s has been wrestling with since Copilot and code copilots arrived.
This guide walks through how the main open-source licenses (MIT/BSD, Apache 2.0, GPL/LGPL, MPL, AGPL) intersect with AI training – what the licenses actually say, what courts and community bodies (OSI, FSF, etc.) are doing with that, and where the real risk lives today.
🧩 What “training on open source code” means in legal terms
When lawyers and engineers talk past each other, it’s usually because they’re pointing at different parts of the AI stack.
| 🔧 Layer | What’s actually happening | Why lawyers care |
|---|---|---|
| Training data (code corpus) | You copy massive amounts of source code into a dataset | Copying & storage of copyrighted works; license terms on that code apply here |
| Training process | You run training code over that corpus to produce model weights | Uses the copied code but usually stays internal; licenses rarely regulate internal use |
| Model (weights + architecture) | A big matrix of parameters that statistically encode patterns in the code | Debate: is this a “derivative work” or just statistics? No court has squarely said “a trained model is a derivative of GPL code.” |
| Outputs (generated code) | Snippets and files produced for users | If outputs substantially reproduce licensed code, downstream users may have to comply with those licenses |
| Downstream product | Your SaaS, IDE plugin, or closed-source app | This is where GPL/AGPL, attribution and share-alike duties may bite if outputs or integrated code are licensed |
Three separate legal regimes are in play:
- Copyright (is training or output an infringing “copy” or “derivative work”?)
- License / contract (did you agree to conditions on how you can copy/use that code, even if copyright might allow more?)
- Community definitions (OSI’s Open Source AI Definition, FSF’s “free ML” criteria) which shape expectations but don’t themselves create liability. (
⚖️ Big-picture: what the law actually says so far
A few points are reasonably clear, and a few are very much not:
- Courts have started to say AI training can be copyright infringement when it uses proprietary content to compete with the rightsholder (e.g. Thomson Reuters v. Ross – legal research headnotes).
- In the GitHub Copilot litigation, most claims were dismissed, but open-source license and DMCA “removal of copyright management information” claims survived, meaning a U.S. court is willing to treat license-based theories around training and regurgitation seriously.
- A detailed 2025 survey of the “GPL propagates to models” theory notes: no court has yet held that a trained model itself must be GPL because it was trained on GPL code, and mainstream community actors (OSI, FSF, SFC) are cautious about pushing that theory into precedent. (Open Source Guy)
- The Open Source AI Definition 1.0 (OSI) requires open code, model parameters and detailed information about training data for an AI system to be called “open source,” but it does not require releasing all training data itself – focusing on transparency and reproducibility instead. (Open Source Initiative)
- FSF, in contrast, is working on criteria where a “free” ML application would require training data and scripts themselves to be free, but that’s an ethical/definitional stance, not an interpretation that current GPL text already covers models.
The net: training on open source code is not automatically illegal, but it is also not a free-for-all. Risk is concentrated around:
- Ignoring license conditions (attribution, notices, copyleft)
- Shipping models that memorise and reproduce licensed code
- Downstream users pasting those outputs into closed-source products.
🏷️ What the major license families actually require
None of the classic FOSS licenses mention “AI” or “training.” They regulate copying, modifying, and distributing software and derivatives. AI training is shoehorned into those concepts.
🔍 Quick comparison of license families
| License family | Typical obligations | Where AI training intersects |
|---|---|---|
| MIT / BSD (permissive) | Keep copyright & license notice when you redistribute code or substantial portions; otherwise broad freedom of use | Training internally on MIT/BSD code is generally within the license grant. Risk appears if your model reproduces recognizable chunks and users ship them without required notices. (Nordia Law) |
| Apache 2.0 (permissive + patents) | Keep license & NOTICE file on distribution; grant / receive patent license; some conditions around patent suits | Similar to MIT/BSD for training; but if outputs or tools embed Apache-licensed code, you must preserve required notices. Patent grant rarely matters for pure training, more so when models or tools embody patented techniques from the original project. |
| GPLv2/v3 (strong copyleft) | If you distribute a program that is a derivative of GPL code or “contains” it, you must license the whole work under GPL and provide source; internal use is unrestricted | Training on GPL code without distributing the training corpus is likely permitted under the license text. The open question is whether the model or its outputs are “derivative works” or “works containing the Program.” No court has said “yes” yet, and major community analyses are skeptical that models fit easily into that definition. (Open Source Guy) |
| LGPL (weak copyleft) | Copyleft mainly for modifications to the library itself; dynamic linking from proprietary apps allowed; source obligations targeted at the library | Training on LGPL code is similar to GPL at the training phase but, again, it’s unclear how a model could be said to “contain” a library in the LGPL sense. Practical risk is mostly around verbatim output of library code. |
| MPL 2.0 (file-level copyleft) | Only files you modify or create based on MPL-covered files must remain MPL; you may combine them with proprietary code | Training on MPL code doesn’t trigger obvious MPL duties on the model. But if an AI assistant regurgitates an MPL-licensed file or substantial portion, and a developer ships it, those specific files must remain MPL-licensed and source-available. |
| AGPL (network copyleft) | Like GPL but extends to software offered over a network (SaaS); if users interact with the AGPL software over a network, they’re entitled to source | AGPL is most dangerous for using AGPL code directly in your service, not for training per se. That said, if a model-powered SaaS embeds AGPL snippets from outputs into server-side code, AGPL’s network copyleft can be triggered. |
A key point: most open-source licenses focus on distribution, not pure internal use. Training is mostly an internal act. The legal heat arrives when:
- You distribute the model or tools in a way that arguably makes them derivatives of licensed code, or
- Users ship generated code that is substantially similar to licensed works without complying with those licenses. (Nordia Law)
🧪 Are models trained on GPL code automatically GPL?
Short answer: today, no one can honestly say “yes” as a matter of settled law.
A 2025 deep dive on “GPL propagation to AI models” summarizes the current state like this: (Open Source Guy)
- Copyright theory: Most courts and scholars that have looked at training models (e.g. in image and music cases) are hesitant to treat the model itself as a reproduction or derivative of the training works, except in extreme cases where it is engineered to spit out specific works verbatim at high frequency.
- GPL text: GPL is written around human-readable source and programs that contain or link GPL code. It doesn’t clearly cover statistical parameter matrices that may only encode tiny traces of GPL code among billions of weights.
- “Preferred form for modification”: If you insisted a model is a GPL derivative, what is the “source”? The weights are not human-modifiable in a meaningful sense; the training data is not the “source” of the model in the GPL sense either. The license text simply wasn’t drafted with models in mind.
- Community bodies:
- OSI’s Open Source AI Definition requires disclosing model code, parameters and detailed data information, but does not require publishing all training data, and does not say that GPL necessarily “propagates” to models. (Open Source Initiative)
- FSF is working on a new statement for free ML applications that would require training data to be free to call an ML app “free,” but that is a new criteria, not a reinterpretation of the current GPL. (Free Software Foundation)
So at the moment:
- Risk of “your model is now GPL” is real in advocacy, theoretical in doctrine, and untested in court.
- What is very real is the risk that outputs containing GPL code pull downstream users into GPL obligations on their own software.
From a practical risk perspective, many serious AI teams treat GPL/AGPL-licensed code in training data as high-friction and either:
- Exclude it outright from training corpora, or
- Keep it in separate, trackable buckets, or
- Accept it only for models they’re comfortable releasing under strong copyleft–style terms. (Astraea Counsel)
💥 Case study: GitHub Copilot and open-source code
The Copilot class action is the main testbed for these issues. Developers sued Microsoft, GitHub and OpenAI, alleging that: (Nordia Law)
- Copilot was trained on massive amounts of GitHub code licensed under MIT, GPL, Apache and others.
- The system sometimes emits code that is nearly identical to open-source repositories, but without attribution or license notices.
- Training and output therefore violate license terms requiring attribution, copyright notices, and (for copyleft licenses) share-alike obligations.
- Copilot at times strips or omits copyright management information (e.g. headers with author names), which plaintiffs argue violates the DMCA’s §1202 prohibitions.
The federal court in California:
- Dismissed many broad and speculative claims, especially those not tied to specific works.
- But allowed two key claims to proceed: open-source license breach and DMCA §1202 “removal of copyright management information.” (Nordia Law)
That tells us:
- Courts are prepared to treat open-source licenses as enforceable contracts in the AI training context.
- The biggest exposure, for now, lies not in abstract “training is infringement” theories, but in concrete regurgitation of licensed code without complying with license conditions.
📊 Risk map: training vs outputs vs products
Here’s a simplified risk matrix for training on open-source code:
| Scenario | Example | Relative legal risk (today) | Why |
|---|---|---|---|
| Internal training on largely permissive code (MIT/BSD/Apache), no external model distribution | Firm trains an internal coding assistant on curated GitHub repos and uses it only inside the org | Low → Moderate | Licenses broadly allow use and modification; no distribution of code or model. Risk increases if outputs are copied verbatim into shipped products without attribution. (Astraea Counsel) |
| Training a public model on mixed code including GPL/AGPL, with no control on memorisation | Start-up releases weights of a model trained on “all of GitHub” | Moderate → High (license & PR) | No case has forced a model to go GPL, but plaintiffs can plausibly allege license breach and DMCA issues if model produces identifiable GPL snippets. Community backlash is almost guaranteed. (Open Source Guy) |
| AI coding assistant used to generate snippets inserted into closed-source products | Dev pastes a 30-line function suggested by a Copilot-like tool into proprietary app | Moderate → High (downstream) | If that snippet is protectable expression copied from GPL/MIT/Apache code, the developer may have to comply with that license (GPL share-alike, MIT/Apache attribution, etc.). This is independent of whether training was lawful. (Nordia Law) |
| Model tuned on proprietary code that competes with rightsholder’s product | Training a legal-research AI on Westlaw headnotes or similar | High (copyright) | Thomson Reuters v. Ross suggests courts are willing to find AI training outright infringing where the AI product serves the same market and leans on proprietary content. (Astraea Counsel) |
| Open-source AI model released under a clear OSS license with curated, documented training data | Model, code and training data info released under consistent OSI-approved licenses | Lower but not zero | Aligns best with OSI’s Open Source AI Definition. Remaining risk is mainly around inclusion of third-party code whose licenses were misapplied or misunderstood. (Open Source Initiative) |
🧾 What MIT, GPL, Apache “actually say” for AI builders
Putting it concretely for the three licenses most people worry about:
✅ MIT / BSD: permissive, but not “no strings”
- They grant very broad rights to use, copy, modify, and merge the code for any purpose, including commercial.
- The main condition is that if you redistribute the code or substantial portions, you must include the copyright notice and license text. (Nordia Law)
- Training internally on MIT/BSD code is squarely inside those grants, and there’s no textual ban on machine learning.
Where problems arise:
- If your model memorises and outputs an MIT-licensed file or recognisable chunk, and someone ships it in a product without notices, that redistribution is non-compliant — even though training itself might be OK.
- Enterprises are therefore building license-aware filters and attribution systems so they can show provenance for code suggestions where possible. (Astraea Counsel)
🧷 Apache 2.0: permissive + patent layer
- Similar to MIT in allowing broad use, but adds:
- A patent license/grant, and
- Conditions about preserving a NOTICE file and certain attributions on redistribution. (Astraea Counsel)
- For training, the big legal question is less about patents and more about not losing attribution and notices when code is regurgitated as suggestions.
In practice:
- Training on Apache 2.0 code is generally seen as acceptable within the license grant, especially for internal models.
- If your assistant suggests an Apache-licensed function and a dev ships it, you may need mechanisms to:
- Detect that it came from Apache-licensed code, and
- Help the dev preserve NOTICE and license text where appropriate.
🧨 GPL / AGPL: copyleft, but not clearly “model-infectious”
What the text clearly does:
- Gives you permission to copy, modify and run GPL code internally, for any purpose.
- Says that if you distribute a program that is a derivative work of GPL code or that “contains” it, you must license the whole work under GPL and provide source.
- AGPL extends this to software offered over a network, not just distributed binaries. (Open Source Guy)
What is not clear:
- Whether a trained model “contains” the GPL program in the textual sense, or is a derivative work.
- Whether distributing model weights counts as distributing a derivative work of the training code.
- What the “preferred form for modification” is for a model (weights? training data? training code?). (Open Source Guy)
Community and scholarship right now largely agree on two operational points:
- Claiming “this model is GPL because it saw GPL code” is, at best, a long shot and, at worst, counter-productive to the open-source ecosystem. (Open Source Guy)
- Using AI tools to copy GPL snippets into closed products is a very real compliance risk, even if no one ever proves license propagation to the model itself.
For cautious teams, that usually translates to either:
- Avoid GPL/AGPL code in training entirely; or
- Restrict it to clearly marked projects where the resulting model and tools will be released under fully open terms compatible with copyleft. (Astraea Counsel)
🧭 Practical guardrails for teams training on open source code
Instead of abstract “don’t do that” rules, here’s a more operational, table-style playbook:
| Situation | Sensible guardrail |
|---|---|
| Building a general-purpose coding assistant | Prefer permissive-only corpora (MIT/BSD/Apache). If GPL/AGPL must be included, track it as a separate tag and implement filters to avoid regurgitation, especially of full files or distinctive functions. (Astraea Counsel) |
| Letting devs paste suggestions directly into products | Add IDE warnings when suggestions match known repos or when similarity crosses a threshold; encourage devs to treat AI suggestions like StackOverflow code: check license before shipping. |
| Releasing models or copilot products publicly | Maintain a training data inventory that at least distinguishes permissive, copyleft, proprietary, and “unknown” sources. Avoid marketing your model as “fully open source” unless it fits OSI’s Open Source AI Definition (code + weights + data information under OSI-approved licenses). (Open Source Initiative) |
| Running on customer codebases | Treat customer code as high-sensitivity, contract-governed data. Segment models trained on customer repos from those trained on internet-scale OSS, and be explicit in contracts about training rights vs. inference-only use. |
| Using open-weight models that were themselves trained on unknown or mixed code | Read the model license; many “open” models are actually “open-weight” with non-commercial or usage restrictions. Don’t assume they’re safe to embed in commercial dev tools just because the weights are downloadable. (Hunton Andrews Kurth) |
📌 Takeaways
- No major FOSS license currently speaks the language of “AI training,” so we’re mapping 1990s text onto 2025 technology.
- For MIT / BSD / Apache, the main legal tension is between:
- Broad permission to use and modify code (including for training), and
- Attribution / notice duties when code (or recognisable chunks of it) are redistributed via model outputs.
- For GPL / AGPL / MPL, the realistic risk today is less “your model is now GPL” and more:
- Outputs carrying copyleft code into closed products, and
- The possibility that future courts or regulators push toward some form of license propagation in extreme memorisation scenarios.
- Courts are moving from “abstract fair use arguments” to granular, fact-intensive analyses of how models were trained and what they output – making documentation, provenance tracking, and output filtering central risk controls. (Astraea Counsel)
From an AI builder’s perspective, the safest posture right now is:
Treat open-source code as licensed, not “free fuel”; design your training and tooling so you could explain, under oath, which licenses you relied on, how you complied with them, and how you prevent your model from becoming a code-copying machine.