CrowdStrike Parsing Standard (CPS)
The standard for our data format as parsed in Next-Gen SIEM.
The standard is based on Elastic Common Schema (ECS), with all deviations and clarifications noted below.
Changelog
- 1.0.0
- The Parsing Standard was previously embedded in the old Package Standards document. That document still exists to document our approach to packages as a whole, but the parsing standard has been extracted so it can be referenced outside of packages. Going forward, the PaSta acronym refers to the parsing standard only.
- Compared to the previous standard from the Package Standards document, the Parsing Standard is changed in the following ways:
- Adds new fields to tag
- Removes the
Product
field, replaced by guidelines for event.module
and event.dataset
- Removes the
event.code
field (to be reinstated later)
- Removes the
related
fields
- Normalises values for a range of new fields
Version 1.0.0
We use the latest 8.x version of ECS (which is the current major version at the time of writing). We are free to upgrade to minor and patch revisions without updating this standard, but going to a new major version requires a new revision of the standard.
Rules
- The following fields shall be tagged:
Cps.version
Vendor
ecs.version
event.dataset
event.kind
event.module
event.outcome
observer.type
- The following fields shall always be populated for all events, unless otherwise noted:
- Event categorization fields (kind, type, category, outcome)
event.outcome
shall only be assigned when an event can logically contain an outcome.
event.type
and event.category
shall be assigned as LogScale arrays, and they are permitted to be empty.
ecs.version
- This field shall contain the version of ECS that is being followed by the parser.
Cps.version
- This field shall contain a MAJOR.MINOR.PATCH version number ala Semantic Versioning.
- The version denotes the version of this standard, which was targeted by the parser during ingest.
Parser.version
- This field shall contain a MAJOR.MINOR.PATCH version number ala Semantic Versioning.
- This version number is specific to the parser which parsed the event, and not related to e.g. the version of the package the parser may have been installed from.
- The rules for updating the version number are:
- Any change to an existing field (large or small) is a breaking change, and requires a new major version.
- If new fields are added, then a new minor version is usually sufficient.
- Patch versions are for parser changes that do not affect which fields are output by it (performance optimization, etc.)
Vendor
- If the event was parsed with a parser from a package, the vendor name used here must match the vendor name used in the package scope (e.g. “fortinet” for “fortinet/fortigate”).
- If a parser sets any of the following fields, those must be consistent with vendor names used in other PaSta-compliant parsers:
observer.vendor
vulnerability.scanner.vendor
device.manufacturer
- See Vendor guidelines for guidelines on which vendor name to use.
event.module
shall contain roughly the name of the product or service that the event belongs to.
- Existing
event.module
values shall be reused whenever appropriate. See event.module
guidelines for more guidelines.
event.dataset
shall either:
- Contain the specific name of the dataset within the module described by
event.module
, prefixed by the value of event.module
with a dot inbetween.
- Not exist, if it doesn’t contain any information beyond what is present in
event.module
.
- Example combinations of the above fields can be:
Vendor |
event.module |
event.dataset |
microsoft |
azure |
azure.entraid |
zscaler |
zia |
zia.web |
- For any given data source, the author of the parser shall determine, on a best-effort basis, which domain specific fields are applicable to the data.
- The only fields in the ECS which parsers shall deviate from are:
- Fields which we use as tags have their names prefixed with
#
.
- The field
event.original
shall not be present, since we use @rawstring
instead.
- The field
event.ingested
shall not be present, since we use @ingesttimestamp
instead.
- The field
@timestamp
shall contain a Unix timestamp, rather than a human readable timestamp.
- The field
event.code
shall not be present for the moment, since we plan to tag it in the future (thus introducing a breaking change).
- The value from
event.code
can still be available to use in a vendor-specific field, e.g. Vendor.event_type
.
- The
related
fields shall not be present.
- The following fields shall all have their values lowercased by the
en-us
locale.
*.address
*.domain
email.*.address
host.hostname
*.hash.*
- Parsers shall strive to make all fields in a log event available as actual LogScale fields, even if they don’t match a field in ECS.
- Fields from the event which do not exist in ECS, shall have its name prefixed with the string literal “Vendor.”
- This gives the ECS fields the “root” namespace, while vendor specific fields can always be found with the “Vendor.” prefix.
- If a field can exist as both an ECS field and a vendor specific field, the following logic applies:
- If the value of both fields is byte-for-byte the same, it is allowed to only preserve the ECS field, and discard the vendor specific field.
- If the value of the fields differ, both fields shall be preserved.
- For example, an ECS field may require its value be lower cased, but the original log has mixed casing. In that case, the vendor specific field shall contain the original, mixed case value.
- When adding new fields to this standard, all fields which are not taken directly from ECS must have a capital letter from the point where they differ from the schema.
- Using capital letters for field names follows the guidance from ECS on how to add event fields outside the schema.
- Example of fully custom field:
Parser.version
is similar to ecs.version
, but the Parser
namespace is our own, so must start with a capital letter.
- Example of extending ECS with custom field:
observer.Fictional_field
where observer
is an existing namespace in the schema, but Fictional_field
is our own field inside that namespace.
Appendix
Appendix A - Reasoning behind packaging standard v0.1.0
These are the reasons we settled on using ECS as our schema, and the deviations we make from it. The decision was made May 31st 2023.
There were several options in play initially:
- We could decide on not having a data model now, but just normalizing field names, so we could apply a data model later via field aliasing.
- This was not deemed good enough to move forward. The main concern is that we will break compatibility with data when we make any move, so instead of making half a move now and half a move later when decide on a data model, we should just make the full leap.
- XDR is not considered broad enough of a model to cover all the general purpose logging that LogScale covers
- We also do not want to create yet another custom data model, as that’s quite a slog to get started on and do right
- Elastic Common Schema (ECS) is the only candidate which seems viable
- It covers a broad range of topics, making it applicable across different log sources
- It can be implemented in a piece-meal fashion
- It has been formally donated to OpenTelemetry (OTel), so we can hopefully refer to it via them, instead of via Elastic (a competitor)
- Where we have decided to deviate from ECS:
- We want to have the original fields of the log available next to the ECS fields. In order to avoid name conflicts between those fields and the ECS fields, all the original fields will be available with the “Vendor.” prefix. Additionally, users can filter on fields which allow them to find only the logs for the product and vendor they care about. With those logs in hand, they can search using the “Vendor.” fields for that product.
- We need to decide up front which fields across all ECS event types need to be tagged. That’s because these tags must be consistent across data sources for the search experience to be coherent.
Appendix B - Deviating from ECS
There’s been a lot of debate about whether we should deviate from ECS or not, with the arguments and decisions captured here.
Those who wish to comply 100% with ECS want us to be able to have simple messaging around being ECS compliant, and not have to attach any asterixes to the statement that we are compliant.
Further, there is a concern that if we “edit” the standard in any way, that we will continue to do so in the future, provoking unnecessary churn for all users.
On the other hand, those who wish to deviate from ECS are wishing to do so in order to have it play well with LogScale in the long run, and make sure we don’t invite unnecessary costs and user experience pains.
As such, the overall principals we are using for deviating from ECS today are:
- Tagging fields changes their names, but we must be able to tag for performance reasons. However, we are keeping tagging to categorisation-like fields, and not other fields, to make the tagging more predictable for users.
- Field values can be subject to different rules than in ECS, like enforcing casing of text in places where ECS does not enforce it.
- Any fields we add outside of ECS must comply with the ECS naming convention for custom fields.
- If we ban certain fields, we do so because they would inflict more user pain than gain in the long run.
Overall, we are committed to our ECS mappings being predictable, and we intend to live with the standard as it is in most places.
Appendix C - Reasoning behind moving to parsing standard v1.0.0
We have built and released a bunch of parsers that normalize incoming events to ECS, so we have gathered some experience now, and we want a stable foundation for NG-SIEM to work off of, so we are making the update now, before NG-SIEM takes off.
The main concerns that we want the standard to support going forward are (in no particular order):
- Supporting faster search speeds as people search across more data
- Keeping performance high and and COGS low as best we can going forward
- Enabling users to search across fields in a more consistent manner
This primarily means:
- applying more tags than before
- removing the
related
fields
- normalising additional fields
- clarifying ambiguities in the first standard
These changes have led to some in-depth discussions, which have their outcomes captured below:
Tagging arrays
We want to tag all the four ECS categorisation fields, but two of those fields are arrays, which don’t work well with the tagging mechanism we have today.
In the end, we settled on not tagging these arrays, as the approaches we could come up with within the current system all had some big flaws.
As such, we are hoping that arrays will become properly taggable at some point in the future.
The two approaches that were discussed was:
- Changing array entries to dedicated fields of their own
- So
event.category[0] := "network"
would become something like event.category.network := "true"
- Merging the array fields into a single field
- So if an event has
event.category[0] := "network"
and event.category[1] := "api"
, it would instead have event.category := "network|api"
Approach #1 suffers from an unintuitive searching experience, where these tags are really just “presence” tags.
That is, we don’t want to assign them any values, since those are meaningless. We only want to tag whether they are present or not.
But creating all the values as individual tags also makes our safety net of tag grouping ineffective.
That is, tag grouping will be enabled when a single tag creates too many data sources, but if all these values have their own tags, tag grouping cannot apply to them collectively, and we must instead rely on the “next” safety net, where no new data sources get created, which is also quite bad.
Approach #2 suffers from a very fragile user experience.
It requires users to search the field differently than other fields, but it also only works as a tag if all the values in the field always have the same sorting.
It’s also very easy for users to search in a wrong way.
For example, they may think that if they are looking for events which belong to the api
and network
categories, that they can search with something like event.category = "api|network"
, but that won’t find events which are categorized ala event.category = "api|authentication|network"
.
And if a new entry is ever added to the list of categories, which happens to be a substring of another category (e.g. file
and profile
), it becomes even easier to write queries which are wrong.
ECS defines the following fields, which we will remove completely:
- related.ip
- related.user
- related.hosts
- related.hash
These fields would serve two overall purposes in LogScale:
- On a given event, show a nice list of all IP addresses, user names, etc. contained within.
- Allow pivoting from e.g. an IP address to seeing all events which relate to that address by running a search ala:
array:contains("related.ip[]", value="1.1.1.1")
However, this comes with a number of tradeoffs in LogScale:
- Data is duplicated across fields, increasing storage costs and slowing search performance
- Parsers become more complex to write and maintain
- Anyone using field aliasing to map data to ECS will be unable to use these arrays properly, as mapping fields to arrays today is very brittle
- In cases where an event contains a lot of data, the duplication increases the likelihood that the event will hit the 1000 field limit and become truncated
- The system is limited to only four “types” today, and if we want to extend that, we have to either add our own new fields here (which risks conflicting with future ECS versions), or we add them in a custom namespace to avoid conflicts. Additionally, any time we want to add an extra type, we also increase the size of events.
We are removing the fields to avoid these tradeoffs. Today we do have some options for supporting the use cases of those fields in a different way, like using raw text search and saved queries, but these also have unsatisfactory tradeoffs. For example, raw text search can result in poorer search performance and false positive results in some cases, while saved queries require maintenance but won’t have clear ownership for now at least.
However, we believe we can build functionality which supports the previous uses without these tradeoffs. That will take some time though, and we will need to rely on saved queries and raw text search in the interim.