What We Learned from Black Duck (And How We Made License Scanning Better)
Black Duck Software -- now part of Synopsys -- invented the commercial open source license scanning category in 2003. For two decades, their approach defined how enterprises managed open source compliance. We studied their methods carefully before building ScanRook. Here is what we learned, what has changed since they started, and why we think the economics of license scanning have fundamentally shifted.
How Black Duck Built the Category
When Black Duck launched in 2003, open source was still viewed with suspicion by most enterprises. SCO was suing IBM for $1 billion over alleged Linux copyright violations. Companies were terrified that their codebases might contain GPL-licensed code that could force them to release proprietary source code. Black Duck saw the opportunity and built a product to address it.
Their approach was technically impressive for its era: crawl every public source code repository on the internet, compute cryptographic hashes for code snippets at multiple granularities (function level, block level, file level), and build a massive proprietary fingerprint database called the KnowledgeBase. When a customer wanted to scan their codebase, Black Duck would hash their source code at the same granularities and match against the KnowledgeBase. Any matches revealed open source code -- and its license -- that had been copied into the proprietary codebase.
This snippet-matching approach solved a real problem: developers copy-paste code from Stack Overflow, vendor packages without attribution, and fork libraries without preserving license files. Package manifest scanning alone cannot detect this. You need to look at the actual code. Black Duck's KnowledgeBase grew to cover billions of code snippets across millions of repositories, and their scanning accuracy was genuinely best-in-class.
The business model was equally straightforward: the KnowledgeBase was proprietary and expensive to build, so Black Duck charged accordingly. Enterprise licenses typically started at $100,000 per year and scaled into the millions for large organizations. M&A due diligence scans -- a common use case where an acquiring company audits the target's open source usage -- could cost $50,000 to $150,000 per engagement.
What Has Changed Since 2005
Black Duck's moat was data: they had the largest database of open source code fingerprints, and building a competing database required years of crawling and indexing. But the world has changed dramatically since 2005, and the barriers to building license scanning capabilities have dropped by orders of magnitude.
Package registries provide license metadata for free
In 2005, many open source packages were distributed as tarballs on personal websites. Today, npm, PyPI, crates.io, Maven Central, RubyGems, and every other major package registry require (or strongly encourage) license declarations in structured metadata. The npm registry alone contains license data for over 3 million packages. You do not need a proprietary database to know that React is MIT-licensed -- it says so in package.json.
Software Heritage has indexed the world's source code
The Software Heritage initiative, launched by Inria and UNESCO, has archived over 18 billion unique source files from 300 million repositories across GitHub, GitLab, Bitbucket, and other forges. Their archive is freely accessible via API. What was once Black Duck's most valuable proprietary asset -- a comprehensive index of the world's open source code -- now exists as a public good.
ScanCode provides 1,800+ license patterns under Apache-2.0
The ScanCode Toolkit by nexB contains over 1,800 license detection rules covering license texts, notices, tags, and URLs. It is Apache-2.0 licensed. The ClearlyDefined project, backed by the Open Source Initiative, provides curated license data for millions of packages via a free API. Together, these tools cover the vast majority of license detection scenarios that Black Duck's KnowledgeBase was built to handle.
MinHash makes fuzzy matching practical without proprietary data
Black Duck's snippet matching relied on exact and near-exact hash comparisons. Modern fuzzy matching techniques like MinHash and SimHash enable detecting code similarity even when the copied code has been reformatted, variable-renamed, or partially modified. These algorithms are well-documented in academic literature and can be implemented against any code corpus -- you do not need a proprietary database to use them.
How ScanRook Approaches License Scanning
We designed ScanRook's license scanning to take advantage of everything that has changed since Black Duck's founding. Our approach is layered, starting with the most reliable and cheapest-to-compute methods and falling back to more sophisticated techniques only when necessary.
Layer 1: Package metadata extraction (free, instant)
For containerized applications -- which represent the majority of modern deployments -- ScanRook reads license data directly from package manager databases. RPM headers, APK metadata, dpkg copyright files, npm package.json, pip METADATA files, and Cargo.toml all contain structured license fields. This covers 90%+ of packages with zero external API calls and sub-second processing time per package. See our license scanning documentation for the full list of supported ecosystems.
Layer 2: ClearlyDefined API for gaps (free, fast)
When package metadata is missing or ambiguous, ScanRook queries the ClearlyDefined API, which provides curated license data for millions of packages. ClearlyDefined data is community-reviewed and freely available. This fills gaps without requiring any proprietary database.
Layer 3: License text pattern matching (local, fast)
For source code tarballs and repositories where structured metadata is not available, ScanRook scans LICENSE, COPYING, NOTICE, and README files using pattern matching against known license texts. This detects licenses even when packages do not include metadata files, covering cases like vendored dependencies and copy-pasted code.
Normalization to SPDX
Every detected license is normalized to its SPDX identifier. This means that Fedora's "ASL 2.0", npm's "Apache-2.0", and PyPI's "License :: OSI Approved :: Apache Software License" all resolve to the same Apache-2.0 identifier. Consistent normalization is essential for policy evaluation across heterogeneous dependency trees.
Why Integrated Beats Standalone
Black Duck is a license-only tool (though Synopsys has added vulnerability scanning post-acquisition). FOSSA is primarily a license compliance tool that bolted on basic vulnerability detection. Both treat license scanning and vulnerability scanning as separate problems with separate tools, separate dashboards, and separate pricing.
We think this separation is artificial. When you scan a container image, you want to know three things: (1) what is inside it (SBOM), (2) is any of it vulnerable (CVE scanning), and (3) are any of the licenses a problem (license compliance). These are three views of the same underlying data -- the list of packages and their metadata. Running three separate tools against the same artifact means parsing the same package databases three times, maintaining three sets of credentials, reconciling three different package name formats, and paying three vendors.
ScanRook produces all three outputs from a single scan. Vulnerabilities, licenses, and SBOM data come from the same package parsing pass, using the same package identifiers, in the same report. License policies and vulnerability severity thresholds can be evaluated together -- because a low-severity CVE in a GPL-3.0 package might warrant more attention than a medium-severity CVE in an MIT package, since the GPL package carries both security and legal risk.
The Cost Comparison
Black Duck (Synopsys SCA) pricing is not publicly listed, but industry reports and customer accounts consistently put it in the $100,000-$500,000/year range for enterprise licenses. FOSSA pricing starts around $20,000/year for small teams and scales into six figures for enterprise. Both require multi-year contracts and professional services engagements for onboarding.
ScanRook provides license scanning, vulnerability scanning, and SBOM generation at a fraction of that cost. The free tier includes license detection for every scan. The Pro tier adds policy enforcement and compliance reporting. For organizations that need self-hosted deployment in air-gapped environments -- a common requirement in defense and financial services -- ScanRook self-hosted provides the same capabilities without sending any data to external services.
This is not a criticism of Black Duck -- they built something genuinely pioneering, and their KnowledgeBase was worth every penny when the alternatives were manual code audits. But the world has changed. The data that made their approach expensive is now freely available, the algorithms are well-understood, and the package ecosystem has standardized on structured license metadata. The economics of license scanning no longer justify six-figure annual fees.
When You Still Need Black Duck
We believe in honest comparisons. There are scenarios where Black Duck (Synopsys SCA) remains the better choice:
- M&A due diligence audits where the acquiring company's legal team requires a report from a recognized brand name. Synopsys's reputation carries weight in boardrooms in a way that newer tools do not yet match.
- Snippet-level detection of copied code that was vendored without any package manager. If developers copy-pasted GPL-licensed code into a proprietary codebase without any package.json or Cargo.toml, metadata-based scanning will miss it. Black Duck's snippet matching is still best-in-class for this use case.
- Existing enterprise contracts where switching costs exceed the price difference. If your organization has already invested in Black Duck integration, custom policies, and team training, the migration cost may not be justified.
For everything else -- container scanning, CI/CD pipeline integration, ongoing compliance monitoring, and new projects starting from scratch -- we think the integrated, metadata-first approach delivers better results at a fundamentally lower cost.
Try It
ScanRook's license scanning is available on every plan, including the free tier. Upload a container image or source archive, and you will see the complete license inventory alongside vulnerabilities and SBOM data -- all from a single scan.