_
College of Computing and School of Public Policy, Georgia Institute of Technology
Accuracy, runtime, and technical complexity
Choice to prioritize:
In most situations, we want as many matches as possible and most of them need to be correct.
An ideal algorithm should be precise or otherwise invariant to minor changes in data format.
For instance, some data may contain street postfix abbreviations (e.g. “STREET” or “ST”) within the owner address whereas others will not.
For practitioners, waiting days for an algorithm to run is not feasible, but it may be acceptable for researchers.
Access to computational hardware is also a consideration.
_
Categories for Runtime
The marginal difference from category 2 to 3 is likely not worth the effort.
How difficult is it for a practitioner or researcher to reproduce the method?
_
This issue can be prevented with the creation of programmatic tools or applications that facilitate the use of these algorithms.
Let us further analyze algorithm efficiency, otherwise known as time and space complexity. This will help us compare potential approaches later.
We can categorize algorithms with Big-O notation.
We write time complexity as O(f(n))
where f(n)
is an upper bound function for the runtime of the algorithm given input of size n
.
_
Runtime is not how long an algorithm takes to run on any individual computer, but rather the number of operations in the worst case.
We focus on how quickly the runtime of an algorithm grows with respect to its input size.
These comparisons are the building blocks of entity resolution algorithms
O(1)
This is an exact comparison between two encoded values (in this case, strings). A computer can take two binary encodings from memory and compare them exactly with a single operation.
_
Example
Both "A" =? "A"
and "1 MAIN ST" =? "1 MAIN ST"
take approximately the same time to execute, despite longer strings in the second group.
O(n)
These are metrics that calculate a similarity score based on aspects of two strings. Many string distance metrics calculate the “distance” between corresponding letters in each string. This requires a comparison for each letter in the shortest string.
_
Example
Running Levenshtein Distance, a common string distance metric, on "A" =? "A"
and "1 MAIN ST" =? "1 MAIN ST"
takes 1 and 9 operations to execute, respectively.
O(n)
These functions utilize the same efficiencies of vectorized comparison by hashing values and placing them into blocks/buckets.
_
Example (Python Pandas)
data.groupby("owner_address")
O(n^2)
These functions typically compare each string to every other string using a similarity metric, grouping strings that are above a certain threshold.
Runtime may be improved with blocking or concurrency; OpenRefine’s clustering function appears to use an optimized implementation.
_
Example (OpenRefine)
Please refer to An et al. 2024
O(n^2)
Finds k
most similar vectors.
<= O(n)
Finds k
most similar vectors, but trades guarantee of 100% accuracy with speed.
Depends
Neural network trained on labeled data.
In Uncovering Neighborhood-level Portfolios of Corporate Single-Family Rental Holdings and Equity Loss, our priority was to:
It was not intended to identify specific corporate portfolios, but this is possible with an additional querying method.
Modified version of owner address
Runtime complexity of O(n)
due to vectorized address key.
Table 1. Examples of Owner Address Key Procedure
Address Number | Address String | Zip Code | Address Key |
---|---|---|---|
52 | CREEKSIDE PARK | 30022 | 52 CREEKSIDE PARK 30022 |
NA (filled with 0) | P.O. BOX 370049 | 30037 | 0 PO BOX 370049 30037 |
NA (filled with 0) | P O Box 370049 | 30037 | 0 PO BOX 370049 30037 |
Table 2. Sample of Corporate Owners and Associated Subsidiaries
Owner Address Key | Sample of Associated Names | Common Name |
---|---|---|
5001 PLAZA ON THE 78746 | ALTO ASSET COMPANY 2 LLC, EPH 2 ASSETS LLC, BAF 1 LLC | Amherst Residential |
1850 PARKWAY 30067 | FKH SFR PROPCO D L P, CERBERUS SFR HOLDINGS LP | FirstKey Homes (Cerberus Capital) |
PO BOX 4090 85261 | HOME SFR BORROWER IV LLC, PROGRESS RESIDENTIAL BORROWER 15 LLC | Progress Residential |
1717 MAIN 75201 | 2018 3 IH BORROWER LP, 2018 2 IH BORROWER LP | Invitation Homes |
591 PUTNAM 6830 | STAR 2021-SFR2 BORROWER L P, STAR 2021 SFR2 BORROWER LP | Colony Starwood Homes |
We utilized this method to create a straightforward framework quantifying equity loss from communities due to institutional investors.
Analyzed a an extensive dataset of parcel and sale records from 2010 to 2022, revealing:
Due to the possibility that the same corporation may use multiple addresses, the address key method is not intended to robustly describe the portfolios of specific corporate owners.
However, it can be done with querying.
_
Querying Steps
address key
Code Example (Python)
# Keywords to query for each owner
# For short strings, like "IH" that might accidentally appear,
# a space is added to reduce this possibility.
owner_keywords = {
"Amherst": ["AMHERST", "ARVM"],
"Cerberus": ["CERBERUS", "FKH", "RM1 ", "RMI "],
"Progress": ["PROGRESS", "FYR"],
"Invitation": ["INVITATION", "IH "],
"Colony": ["COLONY", "STARWOOD", "CSH", "CAH "],
"Sylvan": ["SYLVAN", "RNTR"],
"Tricon": ["TRICON", "TAH"]
}
# Query for each owner described above
for owner in owner_keywords:
query_str = "|".join(owner_keywords[owner])
# Find rows for TAXYR 2020 and where at least one associated name
# contains a keyword matching the owner
owned_by_given_corp = owner_scale[
(owner_scale["TAXYR"] == 2020)
& owner_scale["assoc_owner_names"].apply(
lambda x: any(((re.search(query_str, name)) for name in x)
))
]
# Only retain matched subsidiary owners over a threshold,
# this is to prevent false positive matches. For instance,
# if a single parcel was owned by someone with the last name "SYLVAN"
owned_by_given_corp = owned_by_given_corp[
owned_by_given_corp["count_owned_fulton_yr"] > 49
]
# Sum total over all matched subsidaries
total_owned = owned_by_given_corp["count_owned_fulton_yr"].sum()
Table 3. SFR Ownership Scale of Top 5 Corporate Landlords in Fulton County in 2020
Name | Parcels Owned in Fulton (Address Key Query) | OpenRefine (OR) method (An et al. 2024) | Net Diff. | OR method + manual review (An et al. 2024) | Net Diff. |
---|---|---|---|---|---|
Amherst Residential | 720 | 103 | +617 | 750 | -30 |
Invitation Homes | 677 | 524 | +153 | 719 | -42 |
Progress Residential | 619 | 457 | +159 | 760 | -141 |
Sylvan Realty (RNTR) | 422 | 250 | +172 | 433 | -11 |
Tricon Residential | 219 | 222 | -3 | 280 | -61 |
Cerberus Capital | 256 | 340 | -86 | 349 | -93 |
Starwood Capital | 122 | 361 | -239 | 450 | -328 |
Issues with Simple Address Key
:
_
The simplest form of an Advanced Address Key
extracts and concatenates:
These features are almost always invariant to changes in spelling, word order, street postfixes, or other inconsistencies.
Table 4. Example Advanced Address Key Matching
Address 1 | Zip 1 | ADDR KEY 1 | Address 2 | Zip 2 | ADDR KEY 2 | Matched? |
---|---|---|---|---|---|---|
PO BOX 490734 | 30363 | 490734-0-OX30363 | P O. BOX 490734 | 30363 | 490734-0-OX30363 | Yes |
3505 KOGER BLVD 400 | 30315 | 3505-400-ER30315 | 3505 KOGER BLVD., SUITE 400 | 30315 | 3505-400-ER30315 | Yes |
One Buckhead PL STE 300 | 30305 | 1-300-AD-30305 | One Buckhead PL STE 325 | 30303 | 1-325-AD-30305 | No |
5 PEACHTREE ST | 30308 | 5-0-30308 | 5 PEACHTREE ST | 30354 | 5-0-30354 | No |
Not included by default since this data may vary by state or be inaccessable.
Regardless, considering data inconsistency, matching even good data from business registry records to parcel or sales records requires the use of fuzzy matching algorithms discussed here.
Using an address key querying approach, we’ve demonstrated that how to construct a list of owners associated with a business address.
_
If researchers complete this step within their own metro areas, we can combine this data to create a robust, crowdsourced ownership database.
This database may consist of two tables, one linking address keys to an owner index, and another linking owner indexes to all associated corporate names.
Table 5. Example Schema of Ownership Database (Addresses)
Address Key | Owner Index |
---|---|
5001 PLAZA ON THE 78746 | 1 |
9800 HILLWOOD 76177 | 2 |
1850 PARKWAY 30067 | 2 |
Table 6. Example Schema of Ownership Database (Associated Names)
Owner Index | Associated Names | Common Name |
---|---|---|
1 | ALTO ASSET COMPANY 2 LLC, EPH 2 ASSETS LLC, BAF 1 LLC | Amherst Residential |
2 | FKH SFR PROPCO D L P, CERBERUS SFR HOLDINGS LP | FirstKey Homes |
Appendix C of Polimeni and An 2024 contains this data for Fulton County, GA.
For instance, “2018 3 IH BORROWER LP” and “HOME SFR BORROWER IV LLC” have many letters in common, but they are different entities.
Clustering requires that the data can be vectorized. We can encode owner names string into vectors using a string embedding model. However, these models are generally intended to encode semantic meaning. They are not useful for identifying strings that look similar.
Deep learning requires a correct, labeled dataset to train on. This takes additional effort to create. Furthermore, there are many instances of subsidiary corporations with no discernable pattern.
Ultimately, the problem is solvable from a computational perspective (there exists a polynomial time algorithm), but it is not perfectly solvable from a data perspective.
_
An impossible problem: when owner names and owner addresses are different but they represent the same corporation.
For instance, how is it possible that any algorithm or model can correctly classify “Jeff 1 LLC” as a subsidiary of Amherst Capital if the owner address is different?
There are no string distance metrics that will resolve this issue and no discernable pattern for deep learning.
It is not worth the time to aim for a perfect solution.
_
The best solution is likely an optimized fuzzy address key, coupled with additional data steps to aggregate specific portfolios.
This solution finds an ideal balance between accuracy, runtime, and implementation complexity. Additionally, it generally avoids false positives, which is ideal for most purposes.
To reduce the burden on researchers and increase the speed at which the impact of SFR investment can be studied, a methods paper should concretely compare all current methods.
The goal of this speculative project is to produce:
More information at: nicholaspolimeni.com/posts/methods-sfr/