Introduction to Data Quality

Define “Dirty” Data

Dirty data is defined as inaccurate, incomplete or erroneous data, especially in a computer system or database.

In Salesforce dirty data can manifest in a number of different ways, including:

  • Duplicate records (e.g. two leads with the same information, a contact and a lead with the same information, etc.)
  • Incomplete records (e.g. a lead without an email address or phone number)
  • Inaccurate records (e.g. an opportunity with an inaccurate close date)

Define “Data Quality”

Data quality refers to the usability and accuracy of data (technical definition here).  Dirty data is poor quality.  Completely de-duplicated, properly formatted, populated accurate data is considered clean, and of high quality.

Maintaining Good Data Quality

The standard tools used to create records (import wizards, Data Loader, web-to-lead) are not designed to thoroughly manage data quality:

  • Import wizards have limited criteria to match duplicate records (e.g. name, email address) on import.
  • Duplicate matching occurs only on one object (e.g. import leads will not match against contact records as well).
  • Web-to-lead does not perform any duplicate matching.

Here are a few guidelines that can greatly increase data quality in your org:

  • Make sure your import files are clean (removed of duplicates, properly formatted, etc.) prior to importing data.  Admittedly, this is not always possible or practical.
  • Use the leads object to store lower quality or unverified data.  Only leads that are qualified and of high data quality should be converted to accounts and contacts.
  • Train users to search for existing leads/contacts prior to creating a new lead/contact.
  • Train users to search for duplicate records prior to working with unverified data (e.g. web-to-lead submissions).
  • Use Data.com Duplicate Management (free as of Spring ’15) or third party tools to prevent duplicate records from being created.
  • Use required fields (either via field configuration or page layout), validation rules (look at the REGEX function for complex formatting requirements, such as phone numbers), filtered lookups, and other tools and features to ensure data is entered completely and formatted properly.
  • Use Data.com Clean (additional license fee as of Spring ’15), another third party tool, or manually cleanse existing dirty data.

Let’s look at a few examples:

SymptomPotential CausePotential Solutions
Duplicate lead records.End user did not search for an existing lead prior to creating a new lead.1. Train users to search for duplicate prior to creating new leads.

2. Configure Data.com Duplicate Management Rules.

3. Use a third party solution.
Web-to-lead created a duplicate lead entry.1. Train users to search for duplicate prior to creating new leads.

2. Configure Data.com Duplicate Management Rules.

3. Use a third party solution.
Contact is duplicated as a lead record.Lead import wizard did not match an existing contact with the same email address.1. Train users to search for duplicates on imported leads.

2. Configure Data.com Duplicate Management Rules.

3. Use a third party solution.
A lead record exists with no contact information.Nothing prevents a user from creating a lead with no contact information.1. Make the email and/or phone field required on lead page layout(s).

2. Use a validation rule to ensure that either phone or email is populated.

Manually Cleansing Duplicate Records

Salesforce provides several manual tools to merge duplicate records:

Leads

Find Duplicates (Button)

1-29-2013 7-24-27 PM

1-29-2013 7-25-08 PM

Accounts

Merge Accounts (from the account tab)

1-29-2013 7-29-20 PM

Contacts

Merge contacts button (from an account record)

1-29-2013 7-28-36 PM

Data.com Duplicate Management

Data.com Duplicate Management identifies duplicate records through the following:

Matching Rules specify which fields are evaluated to determine if a duplicate is detected (e.g. First Name, Last Name, Email).  You can leverage the standard rules for some objects (lead, contact, account), or create your own rules if you want to create your own logic or reference custom objects.  Matching rules can specify fuzzy or exact field matches.

Standard Contact and Lead Matching Rule
[Should / Medium / Salesforce.com]

Understanding Matching Rules
[Should / 5m / Salesforce.com]

Duplicate Rules allow the administrator to specify the matching rule(s) (above) that should be evaluated when a record is created or modified, and what should occur as a result (allow or block the action).

Managing Duplicate Records in Salesforce with Duplicate Rules
[Should / 3m / Salesforce.com]

Once activated, here is what an example duplicate rule looks like from an end user’s perspective.  In this example, the user is attempting to create a new lead, but Data.com has found 2 existing duplicate leads and 2 existing duplicate contacts and will block the creation of this record (the save button issues this error):

2015-08-16_16-09-16

To configure this rule, the administrator would create a new duplicate rule (the numbers below correspond to the screenshot):

  1. The object determines when the rule will be evaluated.  In this example, the rule will be evaluated when a lead is created or edited.
    • If we also wanted to prevent duplicate contacts from being created or edited, then we would create a second duplicate rule for the contact object.
  2. Determine what you want to occur when a duplicate is detected upon edit and creation: block or allow.  This example shows that we would block new duplicate leads from being created, but warn users when a duplicate is detected when an existing lead record is modified.
  3. Under matching rules, the administrator can specify which objects and matching rule(s) are used to identify duplicate records.  In this example, we are using the standard matching rule for leads and contacts.
  4. The matching rule specifies the individual fields that are compared and methodology for comparison (e.g. exact or fuzzy match).

2015-08-16_15-52-28

Data Quality Tools (Third Party)

There are a wide range of products to manage data quality within Salesforce.  Here are a few of the more popular options:

NameVendorTypePriceDescription
Demand ToolsCRM FusionPC ApplicationPaidDemand Tools is arguably the industry leading data cleansing tool for Salesforce.
People ImportCRM FusionPC ApplicationPaidPeople Import is designed to import leads/contacts into Salesforce without creating duplicate records (matches against existing leads and contacts).
Dupe BlockerCRM FusionAppExchange PackagePaidDupBlocker blocks the creation of duplicate leads/contacts in Salesforce in real-time.
Various ProductsRingLeadAppExchange Package(s)Free/PaidWhereas most other data quality software is scenario-driven (meaning that the administrator must define what quantifies a duplicate), RingLead maintains a unique matching algorithm.

This makes RingLead potentially a good option for an organization that wants something that "just works".
DupeCatcherSymphonic SourceAppExchange PackageFreeDupeCatcher is a great free option for preventing duplicate lead/contacts in Salesforce in real-time.
CloudingoSymphonic SourceAppExchange PackagePaidCloudingo is full data quality suite, on-demand.

31 Responses to “Introduction to Data Quality”

  1. piyushsharma09 January 14, 2017 at 12:51 am #

    Hi john,

    If any lead comes from web then how this duplicate Rule will work. will the lead be rejected..?

  2. jobzmpons@hotmail.com November 23, 2016 at 5:27 am #

    Hi John,

    Will the duplicate rule trigger if the user does not have access to the original record? I presume yes?

    • JohnCoppedge November 30, 2016 at 4:31 pm #

      That’s a really good question and I can’t seem to find an answer in the documentation.

      I did test this scenario out and found the following:

      -enabled the standard lead matching (matches lead to lead and lead to contact) duplicate check
      -created a duplicate lead with an admin account: got an error referencing the duplicate lead
      -created a duplicate lead with a user account: was able to create the lead (could not view the duplicate lead from this account)
      -went back to the admin account and edited the new duplicate lead. with no field changes, can save the record without triggering the duplicate rule
      -make a field change to the duplicate record (as an admin), and it will trigger the duplicate error.

      Definitely something to explore further if you’re implementing a data duplication prevention strategy.

  3. CarlosSiqueira June 20, 2016 at 11:22 pm #

    John:

    I created a duplicate rule for Contacts and another for Account, exactly as shown above.
    I can login as myself (Admin) or regular user and still able to create a duplicate new record or change an existing one, form both Contacts and Accounts. What I am missing? Both duplicate rules as activated.

    Thanks

    • JohnCoppedge June 22, 2016 at 1:04 am #

      Are you populating the entire record? Check the details around the standard matching alg.- you may not be issuing a match.

      • CarlosSiqueira June 22, 2016 at 2:22 am #

        Oops, I tried to create a new record by typing only 1st and last name.
        Now that you mentioned, I just took an existing Contact, tried to clone it and got the warning. Thanks a bunch!

  4. Andrew DeSanctis September 17, 2015 at 12:58 am #

    FYI….took the exam this morning. Killed.

    Thanks for everything. Get it together on the new developer stuff dude!

    • JohnCoppedge September 21, 2015 at 8:24 pm #

      Awesome congrats Andrew!

      Yeah the Lightning UI and related stuff will definitely make into the site once released – its not actually released yet (Winter 16 around the corner). Glad the site helped!

    • mayousaf July 21, 2016 at 6:00 am #

      Is there a lot of multiples choice questions on validation and data quality in exam as m giving my exam on 26th

      • CarlosSiqueira July 21, 2016 at 6:45 pm #

        I passed last June 28th and don’t recall ANY questions like that. Keep in mind that SF has a “bank”of over 1000 questions and you can get a very tough one like I had on May 18th. Good luck and let us know.

  5. Andrew DeSanctis September 15, 2015 at 7:18 pm #

    The “train users to search for existing records” is really only as good (bad?) as the record level security they have and or / role hierarchy enforcement. Right? If they can’t read or see the record, it won’t turn up in search.

  6. Rena Bennett-Dellwo April 6, 2015 at 11:42 pm #

    A little edit:

    “Only lead that are qualified…”

    should read

    “Only leads that are qualified…” or “Only lead data that are [or is, depending on your singular/plural preference 🙂 ] qualified…”

  7. Kevin Brown March 9, 2015 at 7:50 pm #

    Needs editing:

    Data quality refers the usability and accuracy of data

    should read:

    Data quality refers to the usability and accuracy of data

  8. Puja Parikh November 20, 2014 at 7:00 am #

    Hi John,

    You have done a great job by providing this content, its extremely useful in my ADM201 exam preparation. After reading this material i have gained confidence in my SF knowledge. I have my test on 21st Nov, i look forward to the results .:P

    BTW What would you recommend from two options for Data Quality Tool- RIngLead and Demand Tool?

    Puja

    • JohnCoppedge November 20, 2014 at 10:45 pm #

      Cloud Dingo is another option that is growing in popularity from what I’ve seen. I actually haven’t used any of them extensively myself so I’m afraid I can’t provide a recommendation.

      Good luck on the exam!

  9. Juul Schobbers November 5, 2014 at 9:58 am #

    “DupeCatcher is a great free option for preventing duplicate lead/contacts in Salesforce in real-time”

    By my knowledge DupeCatcher is not preventing duplicate leads/contacts in realtime. You need to save the record first before you’ll get a dupe message. I thought that only Ringlead (paid version) offers you while typing (realtime) a dupe message.

    • JohnCoppedge November 5, 2014 at 7:53 pm #

      Interesting point of clarification – “real time” in this case is intended to mean that it prevents the record from being saved versus rather than allowing a duplicate to be saved and then performing cleanup after the fact. Agreed Ringlead is the solution I’ve seen to include a type-ahead style real time duplicate finder.

  10. Jeanne Busch October 23, 2014 at 5:53 pm #

    minor typo: “Train users to search for duplicate records as prior to working with unverified data (e.g. web-to-lead submissions).”

    remove the “as”.

  11. Vinay M June 27, 2014 at 4:19 pm #

    Thanks roger ! very helpful.

  12. Roger Grilo March 15, 2014 at 9:35 am #

    “Use a validation rule to ensure that either phone or email is populated.”

    —-> an interesting exercise!
    to accomplish this, one possible solution is:
    1. Go to App Setup > Leads > Validation Rules
    2. Click New
    3. Rule Name: “Require_either_email_or_phone”
    4. Check ‘Active’
    5. Enter ‘Error Condition Formula’:
    AND(
    ISBLANK(Phone),
    ISBLANK(Email))
    6. Type on ‘Error Message’:
    “Either a phone number or an email address is required for every lead!”
    7. Click Save
    8. Test on any Lead

    • Davin Casey September 23, 2014 at 8:20 am #

      A minor point, but the formula should use the OR() function, not AND().

      • Scott Waddell September 25, 2014 at 1:52 pm #

        If you use OR, then you’ll get an error if EITHER phone OR email is BLANK. The rule we’re trying to build in this example should only error if BOTH phone AND email are BLANK. So, Roger’s formula is correct.

        Boolean logic is such fun! 🙂

        • Jim Mitchell October 11, 2014 at 5:28 pm #

          Roger’s use case specifically says “or” in the requirement. To me, that would be an OR() function in the error condition formula.

          Otherwise, requirement, and error message, should both be “Both a phone number and an email is required for every lead!”

          Boolean logic is either fun, or not, but it’s never both. 🙂

          • JohnCoppedge October 19, 2014 at 2:22 am #

            I think Scott is right on this one – if you used an OR statement, then the validation rule would fire every time EITHER email OR phone was blank. In short it would make both fields required – which you could just as easily accomplish by making them required on the page layout. The AND actually makes it an OR in a practical sense. Counter intuitive eh 🙂

    • Nithya Gopinath September 3, 2015 at 12:16 am #

      Awesome!!..i just tested in my org..good to know…

Leave a Reply