Skip to content

fix: treat protocol-relative URLs as blocked domains in safe-outputs sanitizer#23930

Merged
pelikhan merged 5 commits intomainfrom
copilot/fix-protocol-relative-url-sanitization
Apr 2, 2026
Merged

fix: treat protocol-relative URLs as blocked domains in safe-outputs sanitizer#23930
pelikhan merged 5 commits intomainfrom
copilot/fix-protocol-relative-url-sanitization

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 1, 2026

sanitizeUrlDomains() was anchored to https:// and sanitizeUrlProtocols() required an explicit scheme, so //evil.com/path passed both filters unchanged. Browsers on HTTPS pages resolve //hostname/pathhttps://hostname/path, fully bypassing the domain allowlist — including in image embeds where camo proxy requests leak to the attacker.

Changes

  • sanitize_content_core.cjssanitizeUrlDomains()

    • Extracted shared allowlist logic into inner helper applyDomainFilter(match, hostnameWithPort)
    • Added a second regex pass for protocol-relative URLs using a delimiter-based precondition (^|[\s([{"']) — matches // only at start-of-string or after clear URL delimiters (whitespace, brackets, quotes), avoiding false matches on // inside HTTPS URL paths (e.g. https://github.com//issues); stop-condition (?!\/\/) is symmetric with the existing https:// stop-condition
  • sanitize_content.test.cjs

    • New "protocol-relative URL sanitization" describe block with 10 regression tests: plain text, markdown links ([text](//evil.com/path)), image embeds (![](//evil.com/pixel.gif)), allowed domains, subdomains, custom domain env var, mixed content, path+query, redaction logging, and a regression case ensuring https:// URLs with double-slash paths (e.g. https://github.com//issues) are not incorrectly redacted
// Before
sanitizeContentCore('[click](//evil.com/steal)')
// → '[click](//evil.com/steal)'   ← not redacted

// After
sanitizeContentCore('[click](//evil.com/steal)')
// → '[click]((evil.com/redacted)'

[!WARNING]

…sanitizer

Protocol-relative URLs (//hostname/path) were not matched by either
sanitizeUrlProtocols() (which requires an explicit scheme://) or
sanitizeUrlDomains() (which was anchored to https://). Browsers on HTTPS
pages resolve //hostname/path to https://hostname/path, so these URLs must
be subject to the same domain allowlist check.

Extends sanitizeUrlDomains() with a second regex pass for //hostname/path
patterns. A negative lookbehind (?<![:/]) prevents the new pattern from
matching the // inside https://. The shared applyDomainFilter() helper is
used by both the existing https:// pass and the new protocol-relative pass,
ensuring identical allow/redact logic.

Fixes: //evil.com/steal, [click](//evil.com/steal), and
![img](//evil.com/pixel.gif) are now redacted to (evil.com/redacted).

Adds 9 regression tests covering plain text, markdown links, image embeds,
allowed/disallowed domains, custom allowed domains, mixed content, path +
query strings, and logging."

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0cad179d-8061-4756-945e-9233c95fdae7

Co-authored-by: szabta89 <1330202+szabta89@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix safe-outputs sanitizer to block protocol-relative URLs fix: treat protocol-relative URLs as blocked domains in safe-outputs sanitizer Apr 1, 2026
Copilot AI requested a review from szabta89 April 1, 2026 16:47
@pelikhan pelikhan marked this pull request as ready for review April 1, 2026 23:52
Copilot AI review requested due to automatic review settings April 1, 2026 23:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a sanitizer bypass where protocol-relative URLs (//host/path) were not subject to domain allowlisting, allowing disallowed domains to pass unchanged (including in markdown links/images).

Changes:

  • Refactors sanitizeUrlDomains() to share allowlist/redaction logic via an inner helper.
  • Adds a second domain-filtering pass to redact protocol-relative URLs (//hostname/...).
  • Adds regression tests covering protocol-relative URLs in plain text, markdown links, and image embeds.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
actions/setup/js/sanitize_content_core.cjs Adds protocol-relative URL domain filtering using a second regex pass and shared allowlist logic.
actions/setup/js/sanitize_content.test.cjs Adds regression tests ensuring protocol-relative URLs are redacted/allowed appropriately and logged.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +288 to +295
// The negative lookbehind (?<![:/]) avoids matching the // inside https://.
// The path stop-condition (?!\/\/) stops before the next protocol-relative URL
// (analogous to how the httpsUrlRegex stops before the next https:// URL).
// The path capture group is present for structural symmetry with httpsUrlRegex
// but is not needed by applyDomainFilter — the full match is used instead.
const protoRelativeUrlRegex = /(?<![:/])\/\/([\w.-]+(?::\d+)?)(\/(?:(?!\/\/)[^\s,])*)?/gi;

s = s.replace(protoRelativeUrlRegex, (match, hostnameWithPort) => applyDomainFilter(match, hostnameWithPort));
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol-relative regex will also match "//" sequences that occur inside the path of an allowed https:// URL (because the lookbehind only excludes ':' and '/'). For example, in "https://github.com//issues", the "//issues" part is preceded by "m" and will be treated as a protocol-relative URL, potentially redacting/altering an otherwise-allowed URL. Consider tightening the precondition so protocol-relative matches only occur at the start of the string or after clear URL delimiters (whitespace, '(', '[', quotes, etc.), rather than after hostname/path characters.

Suggested change
// The negative lookbehind (?<![:/]) avoids matching the // inside https://.
// The path stop-condition (?!\/\/) stops before the next protocol-relative URL
// (analogous to how the httpsUrlRegex stops before the next https:// URL).
// The path capture group is present for structural symmetry with httpsUrlRegex
// but is not needed by applyDomainFilter — the full match is used instead.
const protoRelativeUrlRegex = /(?<![:/])\/\/([\w.-]+(?::\d+)?)(\/(?:(?!\/\/)[^\s,])*)?/gi;
s = s.replace(protoRelativeUrlRegex, (match, hostnameWithPort) => applyDomainFilter(match, hostnameWithPort));
// We only treat // as a protocol-relative URL when it appears at the start of
// the string or immediately after a clear delimiter (whitespace, brackets,
// or quotes). This avoids matching // segments inside the path of an allowed
// https:// URL, such as "https://github.com//issues".
// The path stop-condition (?!\/\/) stops before the next protocol-relative URL
// (analogous to how the httpsUrlRegex stops before the next https:// URL).
// Capture groups:
// 1: prefix (start-of-string or delimiter)
// 2: full protocol-relative URL (starting with //)
// 3: hostname (and optional port)
// 4: optional path
const protoRelativeUrlRegex = /(^|[\s([{"'])(\/\/([\w.-]+(?::\d+)?)(\/(?:(?!\/\/)[^\s,])*)?)/gi;
s = s.replace(
protoRelativeUrlRegex,
(match, prefix, url, hostnameWithPort) => prefix + applyDomainFilter(url, hostnameWithPort)
);

Copilot uses AI. Check for mistakes.
expect(result).toContain("(evil.com/redacted)");
expect(result).not.toContain("//evil.com");
});

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a regression test for an allowed https:// URL that contains a double-slash in the path (e.g., "https://github.com//issues") to ensure the new protocol-relative pass doesn’t accidentally treat path "//..." segments as protocol-relative URLs and redact them.

Suggested change
it("should not treat double slashes in https URL paths as protocol-relative URLs", () => {
const result = sanitizeContent("https://github.com//issues and //evil.com/path");
expect(result).toContain("https://github.com//issues");
expect(result).toContain("(evil.com/redacted)");
expect(result).not.toContain("//evil.com");
});

Copilot uses AI. Check for mistakes.
@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented Apr 2, 2026

@copilot review comments

@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented Apr 2, 2026

@copilot review comments

@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented Apr 2, 2026

@copilot review comments

… lookbehind

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/faa1705f-42c1-4f84-9165-6067c54c13e4

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 2, 2026

@copilot review comments

Addressed both review comments in ae6a1b8:

  1. Regex tightened (sanitize_content_core.cjs): Replaced the lookbehind (?<![:/]) approach with a delimiter-based match (^|[\s([{"']). The regex now only treats // as a protocol-relative URL at start-of-string or after whitespace/brackets/quotes, preventing false matches on // inside HTTPS URL paths like https://github.com//issues.

  2. Regression test added (sanitize_content.test.cjs): Added a test confirming that https://github.com//issues is preserved unchanged while //evil.com/path in the same string is still redacted.

@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented Apr 2, 2026

@copilot add more positive and negative test cases

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 2, 2026

@copilot add more positive and negative test cases

Added 8 more tests in d03fdc8:

Negative (redacted):

  • //evil.com:8080/api — port number
  • //evil.com/a and //bad.org/b — multiple URLs in one string, all redacted
  • src="//evil.com/img.png" — HTML attribute with double-quote delimiter
  • //evil.com — hostname only (no path)

Positive (allowed):

  • //raw.githubusercontent.com/... with *.githubusercontent.com wildcard domain
  • //trusted.net/path with custom GH_AW_ALLOWED_DOMAINS=trusted.net
  • {//github.com/repo} — curly-brace delimited context

Boundary (not treated as protocol-relative):

  • word//evil.com/path// not preceded by a delimiter, passes through unchanged

@pelikhan pelikhan merged commit 0725126 into main Apr 2, 2026
56 checks passed
@pelikhan pelikhan deleted the copilot/fix-protocol-relative-url-sanitization branch April 2, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

safe-outputs sanitizer must treat //hostname protocol-relative URLs as blocked domains

4 participants